evaluma.benchmark

evaluma.benchmark#

Classes#

`Benchmark`	Container for a normalized model-vs-dataset score matrix.
`BenchmarkGroup`	Collection of condition-keyed Benchmark objects.

Module Contents#

class evaluma.benchmark.Benchmark(raw_matrix: pandas.DataFrame, *, norm_ref_low=None, norm_ref_high=None, metric_direction=None, raw_runs=None, dataset_metric_map=None)#

Container for a normalized model-vs-dataset score matrix.

After construction the normalized scores are available as scores_. Use the analysis methods to compute rankings, comparisons, and profiles.

_raw#

_norm_ref_low = None#

_norm_ref_high = None#

_metric_direction = None#

_raw_runs = None#

_dataset_metric_map = None#

_normalize(matrix)#

property scores_#: Normalized model × dataset score matrix.

_new(raw_matrix, raw_runs=None)#

Build a subset Benchmark with normalization bounds frozen from the parent.

Subsetting filters cells without re-scaling the retained scores: the parent’s bounds are resolved to concrete per-dataset Series (on the pre-inversion raw matrix) and restricted to the surviving columns, so a survivor’s normalized score is identical whether or not its peers were dropped. normalize still applies any metric_direction inversion once on these frozen bounds.

property models_#: Model names in row order.

property datasets_#: Dataset names in column order.

select_models(models)#

Subset the benchmark to the given models.

Subsetting filters cells without re-scaling retained scores; the normalization bounds are frozen from the parent.

Parameters:: models – List of model names to retain.
Returns:: New benchmark containing only the selected models.
Return type:: Benchmark

drop_models(exclude)#

Subset the benchmark by dropping specific models.

Subsetting filters cells without re-scaling retained scores; the normalization bounds are frozen from the parent.

Parameters:: exclude – List of model names to remove.
Returns:: New benchmark without the excluded models.
Return type:: Benchmark

select_datasets(datasets)#

Subset the benchmark to the given datasets.

Subsetting filters cells without re-scaling retained scores; the normalization bounds are frozen from the parent.

Parameters:: datasets – List of dataset names to retain.
Returns:: New benchmark containing only the selected datasets.
Return type:: Benchmark

drop_datasets(exclude)#

Subset the benchmark by dropping specific datasets.

Subsetting filters cells without re-scaling retained scores; the normalization bounds are frozen from the parent.

Parameters:: exclude – List of dataset names to remove.
Returns:: New benchmark without the excluded datasets.
Return type:: Benchmark

drop_incomplete()#

Remove models that have missing scores for any dataset.

Returns:: New benchmark with incomplete models removed.
Return type:: Benchmark

iqm_ranking(n_bootstrap=1000, random_state=None)#

Compute IQM rankings with stratified bootstrap confidence intervals.

Implements the Agarwal et al. 2021 (rliable) IQM on the flat run×dataset score array. Requires multiple seeds; use aggregate_ranking() for single-run data.

Parameters:

n_bootstrap – Number of bootstrap samples for the 95 % CI.
random_state – Seed for the random number generator.

Returns:

Result with .table and .plot().

Return type:

IQMResult

Raises:

ValueError – If no seed data is available (_raw_runs is None).

aggregate_ranking(agg='trimmed_mean')#

Compute a point-estimate descriptive ranking (no CI).

Works on any benchmark regardless of whether seed data is present.

Note

This is a descriptive point estimate only (no CI). The trimmed-mean variant trims across datasets, not across seeds; with fewer than ~10 datasets the 25% trim is aggressive (e.g. 5 datasets → only 3 contribute). Treat results as exploratory. For a statistically grounded ranking with uncertainty, use iqm_ranking() (requires multiple seeds).

Parameters:: agg – Aggregation mode — "trimmed_mean" (default), "mean", or "median".
Returns:: Result with .table and .plot().
Return type:: AggregateResult
Raises:: ValueError – If agg is not a supported mode.

improvability_ranking()#

Rank models by mean improvability (distance from the per-dataset best).

For each model, reports the average percent error reduction needed to match the best method on each dataset, faithful to the TabArena / BeyondArena definition. Error is reconstructed in raw score space from each dataset’s metric direction and theoretical optimum (from the metric registry) — never from the normalized scores_ matrix. Lower is better; the per-dataset best method scores 0.

Optima are resolved lazily here (not at load time), so benchmarks whose metrics have no defined optimum still load and serve other methods.

Returns:

Result with .table, .per_dataset, and: .plot().

Return type:

ImprovabilityResult

Raises:

ValueError – If a dataset’s metric is not in the registry and is not explicitly overridden as "min" in metric_direction, or if its error optimum cannot be resolved.

bayesian_comparison(rope=0.01, reference=None, pairs=None, random_state=None)#

Compute pairwise Bayesian comparisons via signed-rank test.

Parameters:

rope – Region of practical equivalence half-width in normalized score space (0–1). Differences smaller than rope are treated as practically equivalent.
reference – If given, only compare each other model against this one.
pairs – Explicit list of (model_a, model_b) pairs to test. Overrides reference.
random_state – Seed for baycomp’s sampler.

Returns:

Result with .table and .plot().

Return type:

BayesianResult

frequentist_comparison(reference=None, alpha=0.05)#

Compute frequentist model comparisons.

All-pairs mode follows the Demšar (2006) / autorank Friedman + Nemenyi workflow. Reference mode is an evaluma extension: pairwise Wilcoxon signed-rank tests against a named baseline with Holm correction.

Runs a Friedman omnibus test first, then either Nemenyi post-hoc (all-pairs mode) or Wilcoxon + Holm correction (reference mode).

Parameters:

reference – If given, only compare each other model against this one using Wilcoxon + Holm. None triggers all-pairs Nemenyi mode.
alpha – Significance level for the significant column (default 0.05).

Returns:

Result with .table and .plot().

Return type:

FrequentistResult

Raises:

ValueError – If fewer than 5 datasets are present.

References

Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. JMLR, 7, 1–30.

elo_ranking(n_bootstrap=1000, random_state=None, tie_threshold=None, calibration_model=None)#

Compute MLE ELO rankings with battle-within-task bootstrap CIs.

Derives a scalar ELO rating per model from pairwise win/loss battles across datasets. Each dataset contributes equally via sample weighting. Complements aggregate_ranking() and iqm_ranking() with a pairwise-derived ranking.

Parameters:

n_bootstrap – Number of bootstrap replicates for 95% CI. Set to 0 to skip bootstrap (CI columns will be NaN).
random_state – Seed for the random number generator.
tie_threshold – Minimum score difference (on the [0,1] normalized scale) to emit a battle. Pairs within the threshold are skipped. None means strict inequality.
calibration_model – If given, shift all ratings so this model has ELO = 1000.

Returns:

Result with .table, .winrate_matrix, .plot(),: and .plot_winrate().

Return type:

EloResult

Raises:

ValueError – If calibration_model is not in the score matrix.

performance_profiles()#

Compute Dolan-Moré performance profiles.

Profiles are computed on the raw (un-normalized) score matrix. All raw values must be strictly positive; see Raises.

Returns:: Result with .table and .plot().
Return type:: ProfileResult
Raises:: ValueError – If any raw score is zero or negative.

_validate_callable_rank_vector(ranks: pandas.Series) → pandas.Series#

Validate and align a custom ranker’s output.

The callable-ranker contract is intentionally narrow: it must return a numeric pd.Series indexed by this benchmark’s models, where lower values are better and 1 denotes the best rank.

Parameters:

ranks – Candidate rank vector returned by a custom ranker.

Returns:

Float rank vector aligned to self.models_ order.

Return type:

pd.Series

Raises:

TypeError – If ranks is not a numeric pd.Series.
ValueError – If the series index does not match the benchmark’s model set, or if any rank value is missing / non-finite.

_rank_vector(ranker)#

Per-model rank Series (rank 1 = best) under a named or custom ranker.

Parameters:

ranker – "avg_rank" (mean of per-dataset ranks), "elo" (MLE ELO rating), "improvability" (mean error reduction to the per-dataset best), or a callable bench -> pd.Series returning literal per-model ranks indexed by model.

Returns:

Per-model ranks (rank 1 = best) indexed by model.

Return type:

pd.Series

Raises:

TypeError – If a callable ranker does not return a numeric pd.Series indexed by model.
ValueError – If ranker is an unknown name.

rank_sensitivity(other, cond_a, cond_b, n_bootstrap=1000, random_state=None, agg='trimmed_mean', ranker='aggregate')#

Quantify whether rankings reorder between two conditions.

Parameters:

other – Benchmark for condition B.
cond_a – Label for this benchmark’s condition.
cond_b – Label for other benchmark’s condition.
n_bootstrap – Number of dataset-bootstrap replicates for 95% CI. Only used when ranker="aggregate".
random_state – Seed for bootstrap sampling.
agg – Per-model aggregation defining the ranking when ranker="aggregate". Defaults to "trimmed_mean" to match aggregate_ranking(); "mean" is available for light-tailed or very-small-N data, and "median" is also accepted.
ranker – Ranking method whose two orderings tau compares. "aggregate" (default) uses agg on the normalized score matrix with a bootstrap CI — the original behavior. "avg_rank", "elo", "improvability", or a callable bench -> pd.Series decouple the ranking from the aggregate family and return a point estimate (tau_ci=(nan, nan)). Custom callables must return literal numeric ranks indexed by model, with lower values better and 1 meaning best.

Returns:

Rank sensitivity result object.

Return type:

RankSensitivityResult

Raises:

TypeError – If other is not a Benchmark.
ValueError – If model or dataset sets differ across benchmarks.

class evaluma.benchmark.BenchmarkGroup(benchmarks)#

Collection of condition-keyed Benchmark objects.

_benchmarks#

__getitem__(key)#

Return the benchmark for one condition label.

Parameters:: key – Condition label key.
Returns:: Benchmark associated with key.
Return type:: Benchmark
Raises:: KeyError – If key is not present.

rank_sensitivity(cond_a, cond_b, n_bootstrap=1000, random_state=None, agg='trimmed_mean', ranker='aggregate')#

Run rank-sensitivity analysis between two conditions in the group.

Parameters:

cond_a – Condition A label.
cond_b – Condition B label.
n_bootstrap – Number of dataset-bootstrap replicates for 95% CI. Only used when ranker="aggregate".
random_state – Seed for bootstrap sampling.
agg – Per-model aggregation defining the ranking when ranker="aggregate". Defaults to "trimmed_mean" to match Benchmark.aggregate_ranking(); "mean" is available for light-tailed or very-small-N data.
ranker – Same ranking selector supported by Benchmark.rank_sensitivity(). "aggregate" preserves the original grouped behavior; alternate rankers return point estimates with tau_ci=(nan, nan).

Returns:

Rank sensitivity result object.

Return type:

RankSensitivityResult

Raises:

KeyError – If either condition label is missing.
ValueError – If the two benchmarks have mismatched model/dataset sets.

select_models(models)#

Select the same model subset across all conditions.

Parameters:: models – List of model names to retain.
Returns:: New group with selected models in each benchmark.
Return type:: BenchmarkGroup

drop_models(exclude)#

Drop the same model subset across all conditions.

Parameters:: exclude – List of model names to remove.
Returns:: New group with excluded models removed.
Return type:: BenchmarkGroup

select_datasets(datasets)#

Select the same dataset subset across all conditions.

Parameters:: datasets – List of dataset names to retain.
Returns:: New group with selected datasets in each benchmark.
Return type:: BenchmarkGroup

drop_datasets(exclude)#

Drop the same dataset subset across all conditions.

Parameters:: exclude – List of dataset names to remove.
Returns:: New group with excluded datasets removed.
Return type:: BenchmarkGroup