evaluma.benchmark#

Classes#

Benchmark

Container for a normalized model-vs-dataset score matrix.

Module Contents#

class evaluma.benchmark.Benchmark(raw_matrix: pandas.DataFrame, *, norm_ref_low=None, norm_ref_high=None, metric_direction=None, raw_runs=None)#

Container for a normalized model-vs-dataset score matrix.

After construction the normalized scores are available as scores_. Use the analysis methods to compute rankings, comparisons, and profiles.

_raw#
_norm_ref_low = None#
_norm_ref_high = None#
_metric_direction = None#
_raw_runs = None#
_normalize(matrix)#
property scores_#

Normalized model × dataset score matrix.

_new(raw_matrix, raw_runs=None)#
select_models(models)#

Subset the benchmark to the given models.

Parameters:

models – List of model names to retain.

Returns:

New benchmark containing only the selected models.

Return type:

Benchmark

select_datasets(datasets)#

Subset the benchmark to the given datasets.

Parameters:

datasets – List of dataset names to retain.

Returns:

New benchmark containing only the selected datasets.

Return type:

Benchmark

drop_incomplete()#

Remove models that have missing scores for any dataset.

Returns:

New benchmark with incomplete models removed.

Return type:

Benchmark

iqm_ranking(n_bootstrap=1000, random_state=None)#

Compute IQM rankings with stratified bootstrap confidence intervals.

Implements the Agarwal et al. 2021 (rliable) IQM on the flat run×dataset score array. Requires multiple seeds; use aggregate_ranking() for single-run data.

Parameters:
  • n_bootstrap – Number of bootstrap samples for the 95 % CI.

  • random_state – Seed for the random number generator.

Returns:

Result with .table and .plot().

Return type:

IQMResult

Raises:

ValueError – If no seed data is available (_raw_runs is None).

aggregate_ranking(agg='trimmed_mean')#

Compute a point-estimate descriptive ranking (no CI).

Works on any benchmark regardless of whether seed data is present.

Parameters:

agg – Aggregation mode — "trimmed_mean" (default), "mean", or "median".

Returns:

Result with .table and .plot().

Return type:

AggregateResult

Raises:

ValueError – If agg is not a supported mode.

bayesian_comparison(rope=0.01, reference=None, pairs=None, random_state=None)#

Compute pairwise Bayesian comparisons via signed-rank test.

Parameters:
  • rope – Region of practical equivalence half-width.

  • reference – If given, only compare each other model against this one.

  • pairs – Explicit list of (model_a, model_b) pairs to test. Overrides reference.

  • random_state – Seed for baycomp’s sampler.

Returns:

Result with .table and .plot().

Return type:

BayesianResult

performance_profiles()#

Compute Dolan-Moré performance profiles.

Returns:

Result with .table and .plot().

Return type:

ProfileResult

Raises:

ValueError – If any raw score is zero or negative.