evaluma

evaluma#

Submodules#

Attributes#

__version__

Classes#

`Benchmark`	Container for a normalized model-vs-dataset score matrix.
`FrequentistResult`	Result of `frequentist_comparison()`.

Functions#

`load_df`(df, *[, model, dataset, metric, score, seed, ...])	Load a DataFrame and return a ready-to-use Benchmark object.
`load_csv`(path, *[, model, dataset, metric, score, ...])	Load a benchmark CSV file and return a ready-to-use Benchmark object.
`_resolve_metric_type_bounds`(metric_type_bounds, ...)	Resolve per-dataset normalization bounds and directions from the metric registry.

Package Contents#

evaluma.__version__: str#

class evaluma.Benchmark(raw_matrix: pandas.DataFrame, *, norm_ref_low=None, norm_ref_high=None, metric_direction=None, raw_runs=None)#

Container for a normalized model-vs-dataset score matrix.

After construction the normalized scores are available as scores_. Use the analysis methods to compute rankings, comparisons, and profiles.

_raw#

_norm_ref_low = None#

_norm_ref_high = None#

_metric_direction = None#

_raw_runs = None#

_normalize(matrix)#

property scores_#: Normalized model × dataset score matrix.

_new(raw_matrix, raw_runs=None)#

select_models(models)#

Subset the benchmark to the given models.

Parameters:: models – List of model names to retain.
Returns:: New benchmark containing only the selected models.
Return type:: Benchmark

select_datasets(datasets)#

Subset the benchmark to the given datasets.

Parameters:: datasets – List of dataset names to retain.
Returns:: New benchmark containing only the selected datasets.
Return type:: Benchmark

drop_incomplete()#

Remove models that have missing scores for any dataset.

Returns:: New benchmark with incomplete models removed.
Return type:: Benchmark

iqm_ranking(n_bootstrap=1000, random_state=None)#

Compute IQM rankings with stratified bootstrap confidence intervals.

Implements the Agarwal et al. 2021 (rliable) IQM on the flat run×dataset score array. Requires multiple seeds; use aggregate_ranking() for single-run data.

Parameters:

n_bootstrap – Number of bootstrap samples for the 95 % CI.
random_state – Seed for the random number generator.

Returns:

Result with .table and .plot().

Return type:

IQMResult

Raises:

ValueError – If no seed data is available (_raw_runs is None).

aggregate_ranking(agg='trimmed_mean')#

Compute a point-estimate descriptive ranking (no CI).

Works on any benchmark regardless of whether seed data is present.

Note

This is a descriptive point estimate only (no CI). The trimmed-mean variant trims across datasets, not across seeds; with fewer than ~10 datasets the 25% trim is aggressive (e.g. 5 datasets → only 3 contribute). Treat results as exploratory. For a statistically grounded ranking with uncertainty, use iqm_ranking() (requires multiple seeds).

Parameters:: agg – Aggregation mode — "trimmed_mean" (default), "mean", or "median".
Returns:: Result with .table and .plot().
Return type:: AggregateResult
Raises:: ValueError – If agg is not a supported mode.

bayesian_comparison(rope=0.01, reference=None, pairs=None, random_state=None)#

Compute pairwise Bayesian comparisons via signed-rank test.

Parameters:

rope – Region of practical equivalence half-width in normalized score space (0–1). Differences smaller than rope are treated as practically equivalent.
reference – If given, only compare each other model against this one.
pairs – Explicit list of (model_a, model_b) pairs to test. Overrides reference.
random_state – Seed for baycomp’s sampler.

Returns:

Result with .table and .plot().

Return type:

BayesianResult

frequentist_comparison(reference=None, alpha=0.05)#

Compute frequentist model comparisons.

All-pairs mode follows the Demšar (2006) / autorank Friedman + Nemenyi workflow. Reference mode is an evaluma extension: pairwise Wilcoxon signed-rank tests against a named baseline with Holm correction.

Runs a Friedman omnibus test first, then either Nemenyi post-hoc (all-pairs mode) or Wilcoxon + Holm correction (reference mode).

Parameters:

reference – If given, only compare each other model against this one using Wilcoxon + Holm. None triggers all-pairs Nemenyi mode.
alpha – Significance level for the significant column (default 0.05).

Returns:

Result with .table and .plot().

Return type:

FrequentistResult

Raises:

ValueError – If fewer than 5 datasets are present.

References

Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. JMLR, 7, 1–30.

performance_profiles()#

Compute Dolan-Moré performance profiles.

Profiles are computed on the raw (un-normalized) score matrix. All raw values must be strictly positive; see Raises.

Returns:: Result with .table and .plot().
Return type:: ProfileResult
Raises:: ValueError – If any raw score is zero or negative.

class evaluma.FrequentistResult(table: pandas.DataFrame, avg_ranks: pandas.Series, friedman_statistic: float, friedman_p_value: float, reference=None, alpha=0.05, cd=None)#

Result of frequentist_comparison().

table#

avg_ranks#

friedman_statistic#

friedman_p_value#

reference = None#

alpha = 0.05#

cd = None#

plot(title=None)#

Render the comparison result.

In all-pairs mode renders a CD diagram (Demšar 2006) with the Nemenyi critical difference bracket. In reference mode renders a horizontal bar chart of Holm-corrected p-values.

Parameters:: title – Optional figure title.
Returns:: The rendered figure.
Return type:: matplotlib.figure.Figure

Example

>>> result = bench.frequentist_comparison()
>>> fig = result.plot()

evaluma.load_df(df, *, model='model', dataset='dataset', metric='metric', score='score', seed=None, metric_type_bounds=None, norm_ref_low=None, norm_ref_high=None, metric_direction=None, drop_incomplete=False)#

Load a DataFrame and return a ready-to-use Benchmark object.

Parameters:

df – A pandas DataFrame in long format (one row per model/dataset pair). To load from a CSV file, use evaluma.load_csv() instead.
model – Column name for the model identifier.
dataset – Column name for the dataset identifier.
metric – Column name for the metric identifier.
score – Column name for the score values.
seed – Column name for the random seed. When provided, all seed rows are preserved and a seed column is included in the loaded DataFrame.
metric_type_bounds – Dict mapping metric names to (low, high) bound tuples. high may be a model name string, resolved per-dataset to that model’s score. When provided, norm_ref_low and norm_ref_high must be None. Metric direction is inferred from the built-in registry; unknown metrics raise ValueError. Bounded metrics not listed here (e.g. accuracy, iou, f1) use their natural [0, 1] bounds automatically. Unbounded metrics (rmse, mae, mse) must be listed; omitting them raises ValueError.
norm_ref_low – Lower normalization reference — scalar, model name, or per-dataset dict. If None, the per-dataset minimum is used and a UserWarning is emitted. Cannot be combined with metric_type_bounds.
norm_ref_high – Upper normalization reference, same format as norm_ref_low. If None, the per-dataset maximum is used. Cannot be combined with metric_type_bounds.
metric_direction – Dict mapping dataset names to "min" or "max". When used with metric_type_bounds, these entries take precedence over the registry-inferred direction. Without metric_type_bounds, datasets mapped to "min" are negated before normalization so that higher is always better.
drop_incomplete – If True, silently drop models with missing scores instead of raising.

Returns:

Normalized benchmark ready for analysis.

Return type:

Benchmark

Raises:

TypeError – If df is not a pandas DataFrame.
ValueError – If metric_type_bounds is provided together with norm_ref_low or norm_ref_high.
ValueError – If the data contains more than one metric per (model, dataset) pair, or if the score matrix is incomplete and drop_incomplete is False.
ValueError – If a metric referenced by a dataset is not in the registry and not covered by metric_type_bounds.
ValueError – If a regression metric (rmse, mae, mse) is present but no upper bound is specified in metric_type_bounds.

evaluma.load_csv(path, *, model='model', dataset='dataset', metric='metric', score='score', seed=None, metric_type_bounds=None, norm_ref_low=None, norm_ref_high=None, metric_direction=None, drop_incomplete=False)#

Load a benchmark CSV file and return a ready-to-use Benchmark object.

Parameters:

path – Path to the CSV file.
model – Column name for the model identifier.
dataset – Column name for the dataset identifier.
metric – Column name for the metric identifier.
score – Column name for the score values.
seed – Column name for the random seed.
metric_type_bounds – See evaluma.load_df().
norm_ref_low – See evaluma.load_df().
norm_ref_high – See evaluma.load_df().
metric_direction – See evaluma.load_df().
drop_incomplete – See evaluma.load_df().

Returns:

Normalized benchmark ready for analysis.

Return type:

Benchmark

evaluma._resolve_metric_type_bounds(metric_type_bounds, dataset_metric_map, raw_matrix, metric_direction_override)#

Resolve per-dataset normalization bounds and directions from the metric registry.

For each dataset, consults metric_type_bounds first, then falls back to the built-in registry for metrics with natural bounds. Raises if an unbounded metric (rmse, mae, mse) has no entry in metric_type_bounds.

Parameters:

metric_type_bounds – Dict mapping metric names → (low, high) tuples.
dataset_metric_map – Dict mapping dataset names → metric name strings.
raw_matrix – Model × dataset score DataFrame (used to resolve model-name bounds).
metric_direction_override – Optional dict mapping dataset names → "min"/"max"; these entries override registry-inferred directions.

Returns:

(norm_ref_low, norm_ref_high, metric_direction) where the first two are pd.Series keyed by dataset and the last is a dict (or None if empty).

Return type:

tuple