evaluma.benchmark#
Classes#
Container for a normalized model-vs-dataset score matrix. |
Module Contents#
- class evaluma.benchmark.Benchmark(raw_matrix: pandas.DataFrame, *, norm_ref_low=None, norm_ref_high=None, metric_direction=None, raw_runs=None)#
Container for a normalized model-vs-dataset score matrix.
After construction the normalized scores are available as
scores_. Use the analysis methods to compute rankings, comparisons, and profiles.- _raw#
- _norm_ref_low = None#
- _norm_ref_high = None#
- _metric_direction = None#
- _raw_runs = None#
- _normalize(matrix)#
- property scores_#
Normalized model × dataset score matrix.
- _new(raw_matrix, raw_runs=None)#
- select_models(models)#
Subset the benchmark to the given models.
- Parameters:
models – List of model names to retain.
- Returns:
New benchmark containing only the selected models.
- Return type:
- select_datasets(datasets)#
Subset the benchmark to the given datasets.
- Parameters:
datasets – List of dataset names to retain.
- Returns:
New benchmark containing only the selected datasets.
- Return type:
- drop_incomplete()#
Remove models that have missing scores for any dataset.
- Returns:
New benchmark with incomplete models removed.
- Return type:
- iqm_ranking(n_bootstrap=1000, random_state=None)#
Compute IQM rankings with stratified bootstrap confidence intervals.
Implements the Agarwal et al. 2021 (rliable) IQM on the flat run×dataset score array. Requires multiple seeds; use
aggregate_ranking()for single-run data.- Parameters:
n_bootstrap – Number of bootstrap samples for the 95 % CI.
random_state – Seed for the random number generator.
- Returns:
Result with
.tableand.plot().- Return type:
- Raises:
ValueError – If no seed data is available (
_raw_runs is None).
- aggregate_ranking(agg='trimmed_mean')#
Compute a point-estimate descriptive ranking (no CI).
Works on any benchmark regardless of whether seed data is present.
- Parameters:
agg – Aggregation mode —
"trimmed_mean"(default),"mean", or"median".- Returns:
Result with
.tableand.plot().- Return type:
- Raises:
ValueError – If
aggis not a supported mode.
- bayesian_comparison(rope=0.01, reference=None, pairs=None, random_state=None)#
Compute pairwise Bayesian comparisons via signed-rank test.
- Parameters:
rope – Region of practical equivalence half-width.
reference – If given, only compare each other model against this one.
pairs – Explicit list of
(model_a, model_b)pairs to test. Overridesreference.random_state – Seed for baycomp’s sampler.
- Returns:
Result with
.tableand.plot().- Return type:
- performance_profiles()#
Compute Dolan-Moré performance profiles.
- Returns:
Result with
.tableand.plot().- Return type:
- Raises:
ValueError – If any raw score is zero or negative.