evaluma.methods.elo

evaluma.methods.elo#

Attributes#

logger

Functions#

`compute_battles`(→ pandas.DataFrame)	Generate pairwise battles from a normalized score matrix.
`compute_winrate_matrix`(→ pandas.DataFrame)	Compute M×M empirical win-rate matrix with equal dataset weighting.
`_fit_elo`(→ pandas.Series)	Fit MLE ELO ratings from a battles DataFrame.
`_iterative_elo`(→ numpy.ndarray)	Iterative ELO update rule fallback.
`compute_battles_from_runs`(→ pandas.DataFrame)	Generate battles from per-seed long-format data.
`compute_elo`(→ tuple[pandas.DataFrame, pandas.DataFrame])	Compute MLE ELO ratings with battle-within-task bootstrap CIs.

Module Contents#

evaluma.methods.elo.logger#

evaluma.methods.elo.compute_battles(scores: pandas.DataFrame, tie_threshold: float | None = None) → pandas.DataFrame#

Generate pairwise battles from a normalized score matrix.

For each dataset, all M*(M-1)/2 model pairs are compared. Every pair produces exactly one battle. Pairs within tie_threshold produce a tie (outcome=0.5). Each battle carries a weight of 1 / n_battles_in_dataset so that all datasets contribute equally to downstream fitting.

Parameters:

scores – Model × dataset normalized score matrix (higher = better).
tie_threshold – Score difference at or below which a pair is called a tie (outcome=0.5). None means ties occur only on exact equality.

Returns:

DataFrame with columns ["model_a", "model_b", "outcome", "dataset", "weight"]. outcome=1.0 when model_a wins, 0.0 when model_b wins, 0.5 for a tie.

evaluma.methods.elo.compute_winrate_matrix(scores: pandas.DataFrame, tie_threshold: float | None = None) → pandas.DataFrame#

Compute M×M empirical win-rate matrix with equal dataset weighting.

Parameters:

scores – Model × dataset normalized score matrix.
tie_threshold – Passed to compute_battles().

Returns:

Square DataFrame (models × models). Cell (i,j) = fraction of datasets where model i beats model j. Diagonal is NaN. Rows sorted by descending average win-rate.

evaluma.methods.elo._fit_elo(battles: pandas.DataFrame, models: list | None = None, scale: float = 400.0, base: float = 10.0, init_rating: float = 1000.0) → pandas.Series#

Fit MLE ELO ratings from a battles DataFrame.

Uses logistic regression with per-battle sample weights. Falls back to iterative ELO (K=32) when logistic regression fails (e.g. total dominance).

Parameters:

battles – Output of compute_battles().
models – Explicit model list. When None, inferred from battles. Must be provided when battles may be empty.
scale – ELO scale parameter (default 400).
base – Logarithm base (default 10).
init_rating – Initial rating for iterative fallback (default 1000).

Returns:

ELO ratings indexed by model name.

Return type:

pd.Series

evaluma.methods.elo._iterative_elo(battles: pandas.DataFrame, model_idx: dict, n: int, scale: float, base: float, init_rating: float, K: float = 32.0) → numpy.ndarray#: Iterative ELO update rule fallback.

evaluma.methods.elo.compute_battles_from_runs(raw_runs: pandas.DataFrame, tie_threshold: float | None = None, metric_direction: dict[str, str] | None = None, norm_bounds: tuple | None = None) → pandas.DataFrame#

Generate battles from per-seed long-format data.

Pairs all models within each (dataset, seed) combination. Each dataset contributes total weight 1 regardless of seed count.

Per-seed scores are made “higher = better” before comparison. When norm_bounds is supplied, scores are min-max normalized to [0, 1] per dataset (so tie_threshold is on the normalized scale, matching the win-rate matrix); otherwise "min" datasets are simply negated and raw scores are compared (TabArena convention).

Parameters:

raw_runs – Long-format DataFrame with columns ["model", "dataset", "seed", "score"].
tie_threshold – Score difference at or below which a pair is a tie.
metric_direction – Maps dataset names to "min" or "max".
norm_bounds – Optional (low, high) per-dataset bound Series. When given, each seed score is normalized via the same min-max rule as the score matrix before thresholding; "min" datasets are inverted by the normalization itself.

Returns:

DataFrame with columns ["model_a", "model_b", "outcome", "dataset", "seed", "weight"].

evaluma.methods.elo.compute_elo(scores: pandas.DataFrame, n_bootstrap: int = 1000, random_state=None, tie_threshold: float | None = None, calibration_model: str | None = None, raw_runs: pandas.DataFrame | None = None, metric_direction: dict[str, str] | None = None, norm_bounds: tuple | None = None) → tuple[pandas.DataFrame, pandas.DataFrame]#

Compute MLE ELO ratings with battle-within-task bootstrap CIs.

Parameters:

scores – Model × dataset normalized score matrix (higher = better).
n_bootstrap – Number of bootstrap replicates for 95% CI. Set to 0 to skip bootstrap (CI columns will be NaN).
random_state – Seed for numpy.random.default_rng().
tie_threshold – Minimum score difference to emit a non-tie battle.
calibration_model – If given, shift all ratings so this model has ELO = 1000.
raw_runs – Long-format per-seed DataFrame with columns ["model", "dataset", "seed", "score"]. When provided, battles are generated per (dataset, seed) via compute_battles_from_runs() and bootstrap resamples within (dataset, seed) groups.
metric_direction – Passed to compute_battles_from_runs(). Ignored when raw_runs is None.
norm_bounds – Optional (low, high) per-dataset bound Series. When provided alongside raw_runs, per-seed scores are normalized to [0, 1] before battles are formed, so tie_threshold matches the normalized scale of the win-rate matrix. None keeps raw seed scores (TabArena convention). Ignored when raw_runs is None.

Returns:

DataFrame with columns ["model", "ELO", "CI_low", "CI_high"] sorted descending by ELO.
M×M win-rate matrix as returned by compute_winrate_matrix().

Return type:

Tuple of

Raises:

ValueError – If calibration_model is not in scores.index.