Overview

Overview#

Given a collection of models evaluated on multiple datasets, the question “which model is best?” is harder than it sounds. Simple averages obscure robustness, ignore statistical uncertainty, and make it difficult to compare methods across papers. However, several works in the literature and different competition benchmarks have explored various schemas to come up with informative rankings. evaluma is designed to be an accessible entry point to build such rankings for your benchmark evaluations providing four complementary analyses from a single input format:

Aggregate ranking — point-estimate ranking via trimmed mean, mean, or median
IQM ranking — robust central tendency with bootstrap confidence intervals (requires multiple seeds per model/dataset)
Bayesian comparison — probability that one model outperforms another
Performance profiles — cumulative distribution showing how often each model comes within a factor of the best

This page walks through installing the package, preparing your results CSV, and running the analyses.

Installation#

pip install evaluma

For a development install from source:

git clone https://github.com/nilsleh/evaluma
cd evaluma
pip install -e ".[dev]"

Input format#

evaluma expects a long-format CSV with one row per (model, dataset) combination:

model	dataset	metric	score
ResNet	CIFAR10	acc	0.91
ResNet	ImageNet	acc	0.87
ViT	CIFAR10	acc	0.94
ViT	ImageNet	acc	0.89
ConvNeXt	CIFAR10	acc	0.93
ConvNeXt	ImageNet	acc	0.85

The four required columns are:

Column	Description
`model`	Model or method name
`dataset`	Dataset or task name
`metric`	Metric name (e.g., `acc`, `iou`, `rmse`)
`score`	The numeric score value

Multiple random seeds per (model, dataset) are supported — add a seed column and pass its name to load_csv().

If your CSV uses different column names, pass them as arguments to load_csv() — see the quickstart below.

Quickstart#

Load your CSV and construct a Benchmark object. The column arguments tell evaluma which CSV columns map to model, dataset, metric, and score:

import evaluma

bench = evaluma.load_csv(
    "results.csv",
    model="model",
    dataset="dataset",
    metric="metric",
    score="score",
)

The Benchmark object holds your data and provides four analysis methods. Each returns a result object with a .table (pandas DataFrame) and a .plot() method:

Aggregate ranking#

Compute a point-estimate ranking across datasets:

agg = bench.aggregate_ranking()
print(agg.table)
agg.plot()

The default aggregation is the trimmed mean (discards the top and bottom 25% of per-dataset scores). Pass agg="mean" or agg="median" for alternatives. This method works with any benchmark, including single-run data.

IQM ranking#

Compute the interquartile mean with 95% bootstrap confidence intervals:

iqm = bench.iqm_ranking()
print(iqm.table)
iqm.plot()

The IQM discards the top and bottom 25% of per-dataset scores before averaging, making it resistant to outliers. The bootstrap CIs are stratified — seeds are resampled independently within each dataset. Requires multiple seeds — pass seed="seed_column" to load_csv() when loading. Use aggregate_ranking() for single-run data. See the IQM Tutorial for a more in-depth example.

Bayesian pairwise comparison#

Compute posterior probabilities for every model pair:

bayes = bench.bayesian_comparison()
print(bayes.table)
bayes.plot()

For each pair (A, B), the output gives p_a_better, p_equiv, and p_b_better — the probabilities that A is better, equivalent (within a practical equivalence region), or worse than B. See the Bayesian Comparison for a more in-depth example.

Performance profiles#

Plot how often each model achieves near-best performance across datasets:

profiles = bench.performance_profiles()
print(profiles.table)
profiles.plot()

The profile curve shows, for each performance ratio τ ≥ 1, the fraction of datasets where a model’s score is within τ of the best. A curve that rises faster means the model is closer to best more often. See the Performance Profile Tutorial for a more in-depth example.

Column mapping#

If your CSV uses different column names, pass them explicitly:

bench = evaluma.load_csv(
    "results.csv",
    model="experiment",
    dataset="task",
    metric="measure",
    score="value",
)

CLI quickstart#

Run all three point-estimate analyses and write CSV and PNG outputs to the current directory:

evaluma report results.csv

With custom column names:

evaluma report results.csv --model experiment --dataset task --score value

Save outputs to a specific directory:

evaluma report results.csv --output ./results/

Individual subcommands are also available: evaluma rank (IQM with seeds), evaluma aggregate, evaluma compare, and evaluma profiles. The CLI supports the same column mapping with --model, --dataset, --metric, and --score flags, or via a YAML config file passed with --config.

Next steps#

IQM Ranking tutorial — deeper dive with worked examples
Bayesian Comparison tutorial — interpreting posterior probabilities
Performance Profiles tutorial — understanding the Dolan-Moré framework