evaluma.benchmark

evaluma.benchmark#

Classes#

Benchmark

Container for a normalized model-vs-dataset score matrix.

Module Contents#

class evaluma.benchmark.Benchmark(raw_matrix: pandas.DataFrame, *, norm_ref_low=None, norm_ref_high=None, metric_direction=None, raw_runs=None)#

Container for a normalized model-vs-dataset score matrix.

After construction the normalized scores are available as scores_. Use the analysis methods to compute rankings, comparisons, and profiles.

_raw#

_norm_ref_low = None#

_norm_ref_high = None#

_metric_direction = None#

_raw_runs = None#

_normalize(matrix)#

property scores_#: Normalized model × dataset score matrix.

_new(raw_matrix, raw_runs=None)#

select_models(models)#

Subset the benchmark to the given models.

Parameters:: models – List of model names to retain.
Returns:: New benchmark containing only the selected models.
Return type:: Benchmark

select_datasets(datasets)#

Subset the benchmark to the given datasets.

Parameters:: datasets – List of dataset names to retain.
Returns:: New benchmark containing only the selected datasets.
Return type:: Benchmark

drop_incomplete()#

Remove models that have missing scores for any dataset.

Returns:: New benchmark with incomplete models removed.
Return type:: Benchmark

iqm_ranking(n_bootstrap=1000, random_state=None)#

Compute IQM rankings with stratified bootstrap confidence intervals.

Implements the Agarwal et al. 2021 (rliable) IQM on the flat run×dataset score array. Requires multiple seeds; use aggregate_ranking() for single-run data.

Parameters:

n_bootstrap – Number of bootstrap samples for the 95 % CI.
random_state – Seed for the random number generator.

Returns:

Result with .table and .plot().

Return type:

IQMResult

Raises:

ValueError – If no seed data is available (_raw_runs is None).

aggregate_ranking(agg='trimmed_mean')#

Compute a point-estimate descriptive ranking (no CI).

Works on any benchmark regardless of whether seed data is present.

Note

This is a descriptive point estimate only (no CI). The trimmed-mean variant trims across datasets, not across seeds; with fewer than ~10 datasets the 25% trim is aggressive (e.g. 5 datasets → only 3 contribute). Treat results as exploratory. For a statistically grounded ranking with uncertainty, use iqm_ranking() (requires multiple seeds).

Parameters:: agg – Aggregation mode — "trimmed_mean" (default), "mean", or "median".
Returns:: Result with .table and .plot().
Return type:: AggregateResult
Raises:: ValueError – If agg is not a supported mode.

bayesian_comparison(rope=0.01, reference=None, pairs=None, random_state=None)#

Compute pairwise Bayesian comparisons via signed-rank test.

Parameters:

rope – Region of practical equivalence half-width in normalized score space (0–1). Differences smaller than rope are treated as practically equivalent.
reference – If given, only compare each other model against this one.
pairs – Explicit list of (model_a, model_b) pairs to test. Overrides reference.
random_state – Seed for baycomp’s sampler.

Returns:

Result with .table and .plot().

Return type:

BayesianResult

frequentist_comparison(reference=None, alpha=0.05)#

Compute frequentist model comparisons.

All-pairs mode follows the Demšar (2006) / autorank Friedman + Nemenyi workflow. Reference mode is an evaluma extension: pairwise Wilcoxon signed-rank tests against a named baseline with Holm correction.

Runs a Friedman omnibus test first, then either Nemenyi post-hoc (all-pairs mode) or Wilcoxon + Holm correction (reference mode).

Parameters:

reference – If given, only compare each other model against this one using Wilcoxon + Holm. None triggers all-pairs Nemenyi mode.
alpha – Significance level for the significant column (default 0.05).

Returns:

Result with .table and .plot().

Return type:

FrequentistResult

Raises:

ValueError – If fewer than 5 datasets are present.

References

Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. JMLR, 7, 1–30.

performance_profiles()#

Compute Dolan-Moré performance profiles.

Profiles are computed on the raw (un-normalized) score matrix. All raw values must be strictly positive; see Raises.

Returns:: Result with .table and .plot().
Return type:: ProfileResult
Raises:: ValueError – If any raw score is zero or negative.