evaluma#

Submodules#

Attributes#

Classes#

Benchmark

Container for a normalized model-vs-dataset score matrix.

FrequentistResult

Result of frequentist_comparison().

Functions#

load_df(df, *[, model, dataset, metric, score, seed, ...])

Load a DataFrame and return a ready-to-use Benchmark object.

load_csv(path, *[, model, dataset, metric, score, ...])

Load a benchmark CSV file and return a ready-to-use Benchmark object.

_resolve_metric_type_bounds(metric_type_bounds, ...)

Resolve per-dataset normalization bounds and directions from the metric registry.

Package Contents#

evaluma.__version__: str#
class evaluma.Benchmark(raw_matrix: pandas.DataFrame, *, norm_ref_low=None, norm_ref_high=None, metric_direction=None, raw_runs=None)#

Container for a normalized model-vs-dataset score matrix.

After construction the normalized scores are available as scores_. Use the analysis methods to compute rankings, comparisons, and profiles.

_raw#
_norm_ref_low = None#
_norm_ref_high = None#
_metric_direction = None#
_raw_runs = None#
_normalize(matrix)#
property scores_#

Normalized model × dataset score matrix.

_new(raw_matrix, raw_runs=None)#
select_models(models)#

Subset the benchmark to the given models.

Parameters:

models – List of model names to retain.

Returns:

New benchmark containing only the selected models.

Return type:

Benchmark

select_datasets(datasets)#

Subset the benchmark to the given datasets.

Parameters:

datasets – List of dataset names to retain.

Returns:

New benchmark containing only the selected datasets.

Return type:

Benchmark

drop_incomplete()#

Remove models that have missing scores for any dataset.

Returns:

New benchmark with incomplete models removed.

Return type:

Benchmark

iqm_ranking(n_bootstrap=1000, random_state=None)#

Compute IQM rankings with stratified bootstrap confidence intervals.

Implements the Agarwal et al. 2021 (rliable) IQM on the flat run×dataset score array. Requires multiple seeds; use aggregate_ranking() for single-run data.

Parameters:
  • n_bootstrap – Number of bootstrap samples for the 95 % CI.

  • random_state – Seed for the random number generator.

Returns:

Result with .table and .plot().

Return type:

IQMResult

Raises:

ValueError – If no seed data is available (_raw_runs is None).

aggregate_ranking(agg='trimmed_mean')#

Compute a point-estimate descriptive ranking (no CI).

Works on any benchmark regardless of whether seed data is present.

Note

This is a descriptive point estimate only (no CI). The trimmed-mean variant trims across datasets, not across seeds; with fewer than ~10 datasets the 25% trim is aggressive (e.g. 5 datasets → only 3 contribute). Treat results as exploratory. For a statistically grounded ranking with uncertainty, use iqm_ranking() (requires multiple seeds).

Parameters:

agg – Aggregation mode — "trimmed_mean" (default), "mean", or "median".

Returns:

Result with .table and .plot().

Return type:

AggregateResult

Raises:

ValueError – If agg is not a supported mode.

bayesian_comparison(rope=0.01, reference=None, pairs=None, random_state=None)#

Compute pairwise Bayesian comparisons via signed-rank test.

Parameters:
  • rope – Region of practical equivalence half-width in normalized score space (0–1). Differences smaller than rope are treated as practically equivalent.

  • reference – If given, only compare each other model against this one.

  • pairs – Explicit list of (model_a, model_b) pairs to test. Overrides reference.

  • random_state – Seed for baycomp’s sampler.

Returns:

Result with .table and .plot().

Return type:

BayesianResult

frequentist_comparison(reference=None, alpha=0.05)#

Compute frequentist model comparisons.

All-pairs mode follows the Demšar (2006) / autorank Friedman + Nemenyi workflow. Reference mode is an evaluma extension: pairwise Wilcoxon signed-rank tests against a named baseline with Holm correction.

Runs a Friedman omnibus test first, then either Nemenyi post-hoc (all-pairs mode) or Wilcoxon + Holm correction (reference mode).

Parameters:
  • reference – If given, only compare each other model against this one using Wilcoxon + Holm. None triggers all-pairs Nemenyi mode.

  • alpha – Significance level for the significant column (default 0.05).

Returns:

Result with .table and .plot().

Return type:

FrequentistResult

Raises:

ValueError – If fewer than 5 datasets are present.

References

Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. JMLR, 7, 1–30.

performance_profiles()#

Compute Dolan-Moré performance profiles.

Profiles are computed on the raw (un-normalized) score matrix. All raw values must be strictly positive; see Raises.

Returns:

Result with .table and .plot().

Return type:

ProfileResult

Raises:

ValueError – If any raw score is zero or negative.

class evaluma.FrequentistResult(table: pandas.DataFrame, avg_ranks: pandas.Series, friedman_statistic: float, friedman_p_value: float, reference=None, alpha=0.05, cd=None)#

Result of frequentist_comparison().

table#
avg_ranks#
friedman_statistic#
friedman_p_value#
reference = None#
alpha = 0.05#
cd = None#
plot(title=None)#

Render the comparison result.

In all-pairs mode renders a CD diagram (Demšar 2006) with the Nemenyi critical difference bracket. In reference mode renders a horizontal bar chart of Holm-corrected p-values.

Parameters:

title – Optional figure title.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

Example

>>> result = bench.frequentist_comparison()
>>> fig = result.plot()
evaluma.load_df(df, *, model='model', dataset='dataset', metric='metric', score='score', seed=None, metric_type_bounds=None, norm_ref_low=None, norm_ref_high=None, metric_direction=None, drop_incomplete=False)#

Load a DataFrame and return a ready-to-use Benchmark object.

Parameters:
  • df – A pandas DataFrame in long format (one row per model/dataset pair). To load from a CSV file, use evaluma.load_csv() instead.

  • model – Column name for the model identifier.

  • dataset – Column name for the dataset identifier.

  • metric – Column name for the metric identifier.

  • score – Column name for the score values.

  • seed – Column name for the random seed. When provided, all seed rows are preserved and a seed column is included in the loaded DataFrame.

  • metric_type_bounds – Dict mapping metric names to (low, high) bound tuples. high may be a model name string, resolved per-dataset to that model’s score. When provided, norm_ref_low and norm_ref_high must be None. Metric direction is inferred from the built-in registry; unknown metrics raise ValueError. Bounded metrics not listed here (e.g. accuracy, iou, f1) use their natural [0, 1] bounds automatically. Unbounded metrics (rmse, mae, mse) must be listed; omitting them raises ValueError.

  • norm_ref_low – Lower normalization reference — scalar, model name, or per-dataset dict. If None, the per-dataset minimum is used and a UserWarning is emitted. Cannot be combined with metric_type_bounds.

  • norm_ref_high – Upper normalization reference, same format as norm_ref_low. If None, the per-dataset maximum is used. Cannot be combined with metric_type_bounds.

  • metric_direction – Dict mapping dataset names to "min" or "max". When used with metric_type_bounds, these entries take precedence over the registry-inferred direction. Without metric_type_bounds, datasets mapped to "min" are negated before normalization so that higher is always better.

  • drop_incomplete – If True, silently drop models with missing scores instead of raising.

Returns:

Normalized benchmark ready for analysis.

Return type:

Benchmark

Raises:
  • TypeError – If df is not a pandas DataFrame.

  • ValueError – If metric_type_bounds is provided together with norm_ref_low or norm_ref_high.

  • ValueError – If the data contains more than one metric per (model, dataset) pair, or if the score matrix is incomplete and drop_incomplete is False.

  • ValueError – If a metric referenced by a dataset is not in the registry and not covered by metric_type_bounds.

  • ValueError – If a regression metric (rmse, mae, mse) is present but no upper bound is specified in metric_type_bounds.

evaluma.load_csv(path, *, model='model', dataset='dataset', metric='metric', score='score', seed=None, metric_type_bounds=None, norm_ref_low=None, norm_ref_high=None, metric_direction=None, drop_incomplete=False)#

Load a benchmark CSV file and return a ready-to-use Benchmark object.

Parameters:
  • path – Path to the CSV file.

  • model – Column name for the model identifier.

  • dataset – Column name for the dataset identifier.

  • metric – Column name for the metric identifier.

  • score – Column name for the score values.

  • seed – Column name for the random seed.

  • metric_type_bounds – See evaluma.load_df().

  • norm_ref_low – See evaluma.load_df().

  • norm_ref_high – See evaluma.load_df().

  • metric_direction – See evaluma.load_df().

  • drop_incomplete – See evaluma.load_df().

Returns:

Normalized benchmark ready for analysis.

Return type:

Benchmark

evaluma._resolve_metric_type_bounds(metric_type_bounds, dataset_metric_map, raw_matrix, metric_direction_override)#

Resolve per-dataset normalization bounds and directions from the metric registry.

For each dataset, consults metric_type_bounds first, then falls back to the built-in registry for metrics with natural bounds. Raises if an unbounded metric (rmse, mae, mse) has no entry in metric_type_bounds.

Parameters:
  • metric_type_bounds – Dict mapping metric names → (low, high) tuples.

  • dataset_metric_map – Dict mapping dataset names → metric name strings.

  • raw_matrix – Model × dataset score DataFrame (used to resolve model-name bounds).

  • metric_direction_override – Optional dict mapping dataset names → "min"/"max"; these entries override registry-inferred directions.

Returns:

(norm_ref_low, norm_ref_high, metric_direction) where the first two are pd.Series keyed by dataset and the last is a dict (or None if empty).

Return type:

tuple