evaluma.methods.elo#

Attributes#

Functions#

compute_battles(→ pandas.DataFrame)

Generate pairwise battles from a normalized score matrix.

compute_winrate_matrix(→ pandas.DataFrame)

Compute M×M empirical win-rate matrix with equal dataset weighting.

_fit_elo(→ pandas.Series)

Fit MLE ELO ratings from a battles DataFrame.

_iterative_elo(→ numpy.ndarray)

Iterative ELO update rule fallback.

compute_battles_from_runs(→ pandas.DataFrame)

Generate battles from per-seed long-format data.

compute_elo(→ tuple[pandas.DataFrame, pandas.DataFrame])

Compute MLE ELO ratings with battle-within-task bootstrap CIs.

Module Contents#

evaluma.methods.elo.logger#
evaluma.methods.elo.compute_battles(scores: pandas.DataFrame, tie_threshold: float | None = None) pandas.DataFrame#

Generate pairwise battles from a normalized score matrix.

For each dataset, all M*(M-1)/2 model pairs are compared. Every pair produces exactly one battle. Pairs within tie_threshold produce a tie (outcome=0.5). Each battle carries a weight of 1 / n_battles_in_dataset so that all datasets contribute equally to downstream fitting.

Parameters:
  • scores – Model × dataset normalized score matrix (higher = better).

  • tie_threshold – Score difference at or below which a pair is called a tie (outcome=0.5). None means ties occur only on exact equality.

Returns:

DataFrame with columns ["model_a", "model_b", "outcome", "dataset", "weight"]. outcome=1.0 when model_a wins, 0.0 when model_b wins, 0.5 for a tie.

evaluma.methods.elo.compute_winrate_matrix(scores: pandas.DataFrame, tie_threshold: float | None = None) pandas.DataFrame#

Compute M×M empirical win-rate matrix with equal dataset weighting.

Parameters:
  • scores – Model × dataset normalized score matrix.

  • tie_threshold – Passed to compute_battles().

Returns:

Square DataFrame (models × models). Cell (i,j) = fraction of datasets where model i beats model j. Diagonal is NaN. Rows sorted by descending average win-rate.

evaluma.methods.elo._fit_elo(battles: pandas.DataFrame, models: list | None = None, scale: float = 400.0, base: float = 10.0, init_rating: float = 1000.0) pandas.Series#

Fit MLE ELO ratings from a battles DataFrame.

Uses logistic regression with per-battle sample weights. Falls back to iterative ELO (K=32) when logistic regression fails (e.g. total dominance).

Parameters:
  • battles – Output of compute_battles().

  • models – Explicit model list. When None, inferred from battles. Must be provided when battles may be empty.

  • scale – ELO scale parameter (default 400).

  • base – Logarithm base (default 10).

  • init_rating – Initial rating for iterative fallback (default 1000).

Returns:

ELO ratings indexed by model name.

Return type:

pd.Series

evaluma.methods.elo._iterative_elo(battles: pandas.DataFrame, model_idx: dict, n: int, scale: float, base: float, init_rating: float, K: float = 32.0) numpy.ndarray#

Iterative ELO update rule fallback.

evaluma.methods.elo.compute_battles_from_runs(raw_runs: pandas.DataFrame, tie_threshold: float | None = None, metric_direction: dict[str, str] | None = None, norm_bounds: tuple | None = None) pandas.DataFrame#

Generate battles from per-seed long-format data.

Pairs all models within each (dataset, seed) combination. Each dataset contributes total weight 1 regardless of seed count.

Per-seed scores are made “higher = better” before comparison. When norm_bounds is supplied, scores are min-max normalized to [0, 1] per dataset (so tie_threshold is on the normalized scale, matching the win-rate matrix); otherwise "min" datasets are simply negated and raw scores are compared (TabArena convention).

Parameters:
  • raw_runs – Long-format DataFrame with columns ["model", "dataset", "seed", "score"].

  • tie_threshold – Score difference at or below which a pair is a tie.

  • metric_direction – Maps dataset names to "min" or "max".

  • norm_bounds – Optional (low, high) per-dataset bound Series. When given, each seed score is normalized via the same min-max rule as the score matrix before thresholding; "min" datasets are inverted by the normalization itself.

Returns:

DataFrame with columns ["model_a", "model_b", "outcome", "dataset", "seed", "weight"].

evaluma.methods.elo.compute_elo(scores: pandas.DataFrame, n_bootstrap: int = 1000, random_state=None, tie_threshold: float | None = None, calibration_model: str | None = None, raw_runs: pandas.DataFrame | None = None, metric_direction: dict[str, str] | None = None, norm_bounds: tuple | None = None) tuple[pandas.DataFrame, pandas.DataFrame]#

Compute MLE ELO ratings with battle-within-task bootstrap CIs.

Parameters:
  • scores – Model × dataset normalized score matrix (higher = better).

  • n_bootstrap – Number of bootstrap replicates for 95% CI. Set to 0 to skip bootstrap (CI columns will be NaN).

  • random_state – Seed for numpy.random.default_rng().

  • tie_threshold – Minimum score difference to emit a non-tie battle.

  • calibration_model – If given, shift all ratings so this model has ELO = 1000.

  • raw_runs – Long-format per-seed DataFrame with columns ["model", "dataset", "seed", "score"]. When provided, battles are generated per (dataset, seed) via compute_battles_from_runs() and bootstrap resamples within (dataset, seed) groups.

  • metric_direction – Passed to compute_battles_from_runs(). Ignored when raw_runs is None.

  • norm_bounds – Optional (low, high) per-dataset bound Series. When provided alongside raw_runs, per-seed scores are normalized to [0, 1] before battles are formed, so tie_threshold matches the normalized scale of the win-rate matrix. None keeps raw seed scores (TabArena convention). Ignored when raw_runs is None.

Returns:

  • DataFrame with columns ["model", "ELO", "CI_low", "CI_high"] sorted descending by ELO.

  • M×M win-rate matrix as returned by compute_winrate_matrix().

Return type:

Tuple of

Raises:

ValueError – If calibration_model is not in scores.index.