evaluma.benchmark#

Classes#

Benchmark

Container for a normalized model-vs-dataset score matrix.

Module Contents#

class evaluma.benchmark.Benchmark(raw_matrix: pandas.DataFrame, *, norm_ref_low=None, norm_ref_high=None, metric_direction=None, raw_runs=None)#

Container for a normalized model-vs-dataset score matrix.

After construction the normalized scores are available as scores_. Use the analysis methods to compute rankings, comparisons, and profiles.

_raw#
_norm_ref_low = None#
_norm_ref_high = None#
_metric_direction = None#
_raw_runs = None#
_normalize(matrix)#
property scores_#

Normalized model × dataset score matrix.

_new(raw_matrix, raw_runs=None)#
select_models(models)#

Subset the benchmark to the given models.

Parameters:

models – List of model names to retain.

Returns:

New benchmark containing only the selected models.

Return type:

Benchmark

select_datasets(datasets)#

Subset the benchmark to the given datasets.

Parameters:

datasets – List of dataset names to retain.

Returns:

New benchmark containing only the selected datasets.

Return type:

Benchmark

drop_incomplete()#

Remove models that have missing scores for any dataset.

Returns:

New benchmark with incomplete models removed.

Return type:

Benchmark

iqm_ranking(n_bootstrap=1000, random_state=None)#

Compute IQM rankings with stratified bootstrap confidence intervals.

Implements the Agarwal et al. 2021 (rliable) IQM on the flat run×dataset score array. Requires multiple seeds; use aggregate_ranking() for single-run data.

Parameters:
  • n_bootstrap – Number of bootstrap samples for the 95 % CI.

  • random_state – Seed for the random number generator.

Returns:

Result with .table and .plot().

Return type:

IQMResult

Raises:

ValueError – If no seed data is available (_raw_runs is None).

aggregate_ranking(agg='trimmed_mean')#

Compute a point-estimate descriptive ranking (no CI).

Works on any benchmark regardless of whether seed data is present.

Note

This is a descriptive point estimate only (no CI). The trimmed-mean variant trims across datasets, not across seeds; with fewer than ~10 datasets the 25% trim is aggressive (e.g. 5 datasets → only 3 contribute). Treat results as exploratory. For a statistically grounded ranking with uncertainty, use iqm_ranking() (requires multiple seeds).

Parameters:

agg – Aggregation mode — "trimmed_mean" (default), "mean", or "median".

Returns:

Result with .table and .plot().

Return type:

AggregateResult

Raises:

ValueError – If agg is not a supported mode.

bayesian_comparison(rope=0.01, reference=None, pairs=None, random_state=None)#

Compute pairwise Bayesian comparisons via signed-rank test.

Parameters:
  • rope – Region of practical equivalence half-width in normalized score space (0–1). Differences smaller than rope are treated as practically equivalent.

  • reference – If given, only compare each other model against this one.

  • pairs – Explicit list of (model_a, model_b) pairs to test. Overrides reference.

  • random_state – Seed for baycomp’s sampler.

Returns:

Result with .table and .plot().

Return type:

BayesianResult

frequentist_comparison(reference=None, alpha=0.05)#

Compute frequentist model comparisons.

All-pairs mode follows the Demšar (2006) / autorank Friedman + Nemenyi workflow. Reference mode is an evaluma extension: pairwise Wilcoxon signed-rank tests against a named baseline with Holm correction.

Runs a Friedman omnibus test first, then either Nemenyi post-hoc (all-pairs mode) or Wilcoxon + Holm correction (reference mode).

Parameters:
  • reference – If given, only compare each other model against this one using Wilcoxon + Holm. None triggers all-pairs Nemenyi mode.

  • alpha – Significance level for the significant column (default 0.05).

Returns:

Result with .table and .plot().

Return type:

FrequentistResult

Raises:

ValueError – If fewer than 5 datasets are present.

References

Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. JMLR, 7, 1–30.

performance_profiles()#

Compute Dolan-Moré performance profiles.

Profiles are computed on the raw (un-normalized) score matrix. All raw values must be strictly positive; see Raises.

Returns:

Result with .table and .plot().

Return type:

ProfileResult

Raises:

ValueError – If any raw score is zero or negative.