evaluma.benchmark#
Classes#
Container for a normalized model-vs-dataset score matrix. |
Module Contents#
- class evaluma.benchmark.Benchmark(raw_matrix: pandas.DataFrame, *, norm_ref_low=None, norm_ref_high=None, metric_direction=None, raw_runs=None)#
Container for a normalized model-vs-dataset score matrix.
After construction the normalized scores are available as
scores_. Use the analysis methods to compute rankings, comparisons, and profiles.- _raw#
- _norm_ref_low = None#
- _norm_ref_high = None#
- _metric_direction = None#
- _raw_runs = None#
- _normalize(matrix)#
- property scores_#
Normalized model × dataset score matrix.
- _new(raw_matrix, raw_runs=None)#
- select_models(models)#
Subset the benchmark to the given models.
- Parameters:
models – List of model names to retain.
- Returns:
New benchmark containing only the selected models.
- Return type:
- select_datasets(datasets)#
Subset the benchmark to the given datasets.
- Parameters:
datasets – List of dataset names to retain.
- Returns:
New benchmark containing only the selected datasets.
- Return type:
- drop_incomplete()#
Remove models that have missing scores for any dataset.
- Returns:
New benchmark with incomplete models removed.
- Return type:
- iqm_ranking(n_bootstrap=1000, random_state=None)#
Compute IQM rankings with stratified bootstrap confidence intervals.
Implements the Agarwal et al. 2021 (rliable) IQM on the flat run×dataset score array. Requires multiple seeds; use
aggregate_ranking()for single-run data.- Parameters:
n_bootstrap – Number of bootstrap samples for the 95 % CI.
random_state – Seed for the random number generator.
- Returns:
Result with
.tableand.plot().- Return type:
- Raises:
ValueError – If no seed data is available (
_raw_runs is None).
- aggregate_ranking(agg='trimmed_mean')#
Compute a point-estimate descriptive ranking (no CI).
Works on any benchmark regardless of whether seed data is present.
Note
This is a descriptive point estimate only (no CI). The trimmed-mean variant trims across datasets, not across seeds; with fewer than ~10 datasets the 25% trim is aggressive (e.g. 5 datasets → only 3 contribute). Treat results as exploratory. For a statistically grounded ranking with uncertainty, use
iqm_ranking()(requires multiple seeds).- Parameters:
agg – Aggregation mode —
"trimmed_mean"(default),"mean", or"median".- Returns:
Result with
.tableand.plot().- Return type:
- Raises:
ValueError – If
aggis not a supported mode.
- bayesian_comparison(rope=0.01, reference=None, pairs=None, random_state=None)#
Compute pairwise Bayesian comparisons via signed-rank test.
- Parameters:
rope – Region of practical equivalence half-width in normalized score space (0–1). Differences smaller than
ropeare treated as practically equivalent.reference – If given, only compare each other model against this one.
pairs – Explicit list of
(model_a, model_b)pairs to test. Overridesreference.random_state – Seed for baycomp’s sampler.
- Returns:
Result with
.tableand.plot().- Return type:
- frequentist_comparison(reference=None, alpha=0.05)#
Compute frequentist model comparisons.
All-pairs mode follows the Demšar (2006) / autorank Friedman + Nemenyi workflow. Reference mode is an evaluma extension: pairwise Wilcoxon signed-rank tests against a named baseline with Holm correction.
Runs a Friedman omnibus test first, then either Nemenyi post-hoc (all-pairs mode) or Wilcoxon + Holm correction (reference mode).
- Parameters:
reference – If given, only compare each other model against this one using Wilcoxon + Holm.
Nonetriggers all-pairs Nemenyi mode.alpha – Significance level for the
significantcolumn (default 0.05).
- Returns:
Result with
.tableand.plot().- Return type:
- Raises:
ValueError – If fewer than 5 datasets are present.
References
Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. JMLR, 7, 1–30.
- performance_profiles()#
Compute Dolan-Moré performance profiles.
Profiles are computed on the raw (un-normalized) score matrix. All raw values must be strictly positive; see
Raises.- Returns:
Result with
.tableand.plot().- Return type:
- Raises:
ValueError – If any raw score is zero or negative.