evaluma#
Submodules#
Attributes#
Classes#
Container for a normalized model-vs-dataset score matrix. |
|
Result of |
Functions#
|
Load a DataFrame and return a ready-to-use Benchmark object. |
|
Load a benchmark CSV file and return a ready-to-use Benchmark object. |
|
Resolve per-dataset normalization bounds and directions from the metric registry. |
Package Contents#
- evaluma.__version__: str#
- class evaluma.Benchmark(raw_matrix: pandas.DataFrame, *, norm_ref_low=None, norm_ref_high=None, metric_direction=None, raw_runs=None)#
Container for a normalized model-vs-dataset score matrix.
After construction the normalized scores are available as
scores_. Use the analysis methods to compute rankings, comparisons, and profiles.- _raw#
- _norm_ref_low = None#
- _norm_ref_high = None#
- _metric_direction = None#
- _raw_runs = None#
- _normalize(matrix)#
- property scores_#
Normalized model × dataset score matrix.
- _new(raw_matrix, raw_runs=None)#
- select_models(models)#
Subset the benchmark to the given models.
- Parameters:
models – List of model names to retain.
- Returns:
New benchmark containing only the selected models.
- Return type:
- select_datasets(datasets)#
Subset the benchmark to the given datasets.
- Parameters:
datasets – List of dataset names to retain.
- Returns:
New benchmark containing only the selected datasets.
- Return type:
- drop_incomplete()#
Remove models that have missing scores for any dataset.
- Returns:
New benchmark with incomplete models removed.
- Return type:
- iqm_ranking(n_bootstrap=1000, random_state=None)#
Compute IQM rankings with stratified bootstrap confidence intervals.
Implements the Agarwal et al. 2021 (rliable) IQM on the flat run×dataset score array. Requires multiple seeds; use
aggregate_ranking()for single-run data.- Parameters:
n_bootstrap – Number of bootstrap samples for the 95 % CI.
random_state – Seed for the random number generator.
- Returns:
Result with
.tableand.plot().- Return type:
- Raises:
ValueError – If no seed data is available (
_raw_runs is None).
- aggregate_ranking(agg='trimmed_mean')#
Compute a point-estimate descriptive ranking (no CI).
Works on any benchmark regardless of whether seed data is present.
Note
This is a descriptive point estimate only (no CI). The trimmed-mean variant trims across datasets, not across seeds; with fewer than ~10 datasets the 25% trim is aggressive (e.g. 5 datasets → only 3 contribute). Treat results as exploratory. For a statistically grounded ranking with uncertainty, use
iqm_ranking()(requires multiple seeds).- Parameters:
agg – Aggregation mode —
"trimmed_mean"(default),"mean", or"median".- Returns:
Result with
.tableand.plot().- Return type:
- Raises:
ValueError – If
aggis not a supported mode.
- bayesian_comparison(rope=0.01, reference=None, pairs=None, random_state=None)#
Compute pairwise Bayesian comparisons via signed-rank test.
- Parameters:
rope – Region of practical equivalence half-width in normalized score space (0–1). Differences smaller than
ropeare treated as practically equivalent.reference – If given, only compare each other model against this one.
pairs – Explicit list of
(model_a, model_b)pairs to test. Overridesreference.random_state – Seed for baycomp’s sampler.
- Returns:
Result with
.tableand.plot().- Return type:
- frequentist_comparison(reference=None, alpha=0.05)#
Compute frequentist model comparisons.
All-pairs mode follows the Demšar (2006) / autorank Friedman + Nemenyi workflow. Reference mode is an evaluma extension: pairwise Wilcoxon signed-rank tests against a named baseline with Holm correction.
Runs a Friedman omnibus test first, then either Nemenyi post-hoc (all-pairs mode) or Wilcoxon + Holm correction (reference mode).
- Parameters:
reference – If given, only compare each other model against this one using Wilcoxon + Holm.
Nonetriggers all-pairs Nemenyi mode.alpha – Significance level for the
significantcolumn (default 0.05).
- Returns:
Result with
.tableand.plot().- Return type:
- Raises:
ValueError – If fewer than 5 datasets are present.
References
Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. JMLR, 7, 1–30.
- performance_profiles()#
Compute Dolan-Moré performance profiles.
Profiles are computed on the raw (un-normalized) score matrix. All raw values must be strictly positive; see
Raises.- Returns:
Result with
.tableand.plot().- Return type:
- Raises:
ValueError – If any raw score is zero or negative.
- class evaluma.FrequentistResult(table: pandas.DataFrame, avg_ranks: pandas.Series, friedman_statistic: float, friedman_p_value: float, reference=None, alpha=0.05, cd=None)#
Result of
frequentist_comparison().- table#
- avg_ranks#
- friedman_statistic#
- friedman_p_value#
- reference = None#
- alpha = 0.05#
- cd = None#
- plot(title=None)#
Render the comparison result.
In all-pairs mode renders a CD diagram (Demšar 2006) with the Nemenyi critical difference bracket. In reference mode renders a horizontal bar chart of Holm-corrected p-values.
- Parameters:
title – Optional figure title.
- Returns:
The rendered figure.
- Return type:
matplotlib.figure.Figure
Example
>>> result = bench.frequentist_comparison() >>> fig = result.plot()
- evaluma.load_df(df, *, model='model', dataset='dataset', metric='metric', score='score', seed=None, metric_type_bounds=None, norm_ref_low=None, norm_ref_high=None, metric_direction=None, drop_incomplete=False)#
Load a DataFrame and return a ready-to-use Benchmark object.
- Parameters:
df – A pandas DataFrame in long format (one row per model/dataset pair). To load from a CSV file, use
evaluma.load_csv()instead.model – Column name for the model identifier.
dataset – Column name for the dataset identifier.
metric – Column name for the metric identifier.
score – Column name for the score values.
seed – Column name for the random seed. When provided, all seed rows are preserved and a
seedcolumn is included in the loaded DataFrame.metric_type_bounds – Dict mapping metric names to
(low, high)bound tuples.highmay be a model name string, resolved per-dataset to that model’s score. When provided,norm_ref_lowandnorm_ref_highmust beNone. Metric direction is inferred from the built-in registry; unknown metrics raiseValueError. Bounded metrics not listed here (e.g. accuracy, iou, f1) use their natural[0, 1]bounds automatically. Unbounded metrics (rmse, mae, mse) must be listed; omitting them raisesValueError.norm_ref_low – Lower normalization reference — scalar, model name, or per-dataset dict. If
None, the per-dataset minimum is used and aUserWarningis emitted. Cannot be combined withmetric_type_bounds.norm_ref_high – Upper normalization reference, same format as
norm_ref_low. IfNone, the per-dataset maximum is used. Cannot be combined withmetric_type_bounds.metric_direction – Dict mapping dataset names to
"min"or"max". When used withmetric_type_bounds, these entries take precedence over the registry-inferred direction. Withoutmetric_type_bounds, datasets mapped to"min"are negated before normalization so that higher is always better.drop_incomplete – If
True, silently drop models with missing scores instead of raising.
- Returns:
Normalized benchmark ready for analysis.
- Return type:
- Raises:
TypeError – If
dfis not a pandas DataFrame.ValueError – If
metric_type_boundsis provided together withnorm_ref_lowornorm_ref_high.ValueError – If the data contains more than one metric per (model, dataset) pair, or if the score matrix is incomplete and
drop_incompleteisFalse.ValueError – If a metric referenced by a dataset is not in the registry and not covered by
metric_type_bounds.ValueError – If a regression metric (rmse, mae, mse) is present but no upper bound is specified in
metric_type_bounds.
- evaluma.load_csv(path, *, model='model', dataset='dataset', metric='metric', score='score', seed=None, metric_type_bounds=None, norm_ref_low=None, norm_ref_high=None, metric_direction=None, drop_incomplete=False)#
Load a benchmark CSV file and return a ready-to-use Benchmark object.
- Parameters:
path – Path to the CSV file.
model – Column name for the model identifier.
dataset – Column name for the dataset identifier.
metric – Column name for the metric identifier.
score – Column name for the score values.
seed – Column name for the random seed.
metric_type_bounds – See
evaluma.load_df().norm_ref_low – See
evaluma.load_df().norm_ref_high – See
evaluma.load_df().metric_direction – See
evaluma.load_df().drop_incomplete – See
evaluma.load_df().
- Returns:
Normalized benchmark ready for analysis.
- Return type:
- evaluma._resolve_metric_type_bounds(metric_type_bounds, dataset_metric_map, raw_matrix, metric_direction_override)#
Resolve per-dataset normalization bounds and directions from the metric registry.
For each dataset, consults
metric_type_boundsfirst, then falls back to the built-in registry for metrics with natural bounds. Raises if an unbounded metric (rmse, mae, mse) has no entry inmetric_type_bounds.- Parameters:
metric_type_bounds – Dict mapping metric names →
(low, high)tuples.dataset_metric_map – Dict mapping dataset names → metric name strings.
raw_matrix – Model × dataset score DataFrame (used to resolve model-name bounds).
metric_direction_override – Optional dict mapping dataset names →
"min"/"max"; these entries override registry-inferred directions.
- Returns:
(norm_ref_low, norm_ref_high, metric_direction)where the first two arepd.Serieskeyed by dataset and the last is a dict (orNoneif empty).- Return type:
tuple