evaluma.methods.elo#
Attributes#
Functions#
|
Generate pairwise battles from a normalized score matrix. |
|
Compute M×M empirical win-rate matrix with equal dataset weighting. |
|
Fit MLE ELO ratings from a battles DataFrame. |
|
Iterative ELO update rule fallback. |
|
Generate battles from per-seed long-format data. |
|
Compute MLE ELO ratings with battle-within-task bootstrap CIs. |
Module Contents#
- evaluma.methods.elo.logger#
- evaluma.methods.elo.compute_battles(scores: pandas.DataFrame, tie_threshold: float | None = None) pandas.DataFrame#
Generate pairwise battles from a normalized score matrix.
For each dataset, all M*(M-1)/2 model pairs are compared. Every pair produces exactly one battle. Pairs within
tie_thresholdproduce a tie (outcome=0.5). Each battle carries a weight of1 / n_battles_in_datasetso that all datasets contribute equally to downstream fitting.- Parameters:
scores – Model × dataset normalized score matrix (higher = better).
tie_threshold – Score difference at or below which a pair is called a tie (outcome=0.5).
Nonemeans ties occur only on exact equality.
- Returns:
DataFrame with columns
["model_a", "model_b", "outcome", "dataset", "weight"].outcome=1.0when model_a wins,0.0when model_b wins,0.5for a tie.
- evaluma.methods.elo.compute_winrate_matrix(scores: pandas.DataFrame, tie_threshold: float | None = None) pandas.DataFrame#
Compute M×M empirical win-rate matrix with equal dataset weighting.
- Parameters:
scores – Model × dataset normalized score matrix.
tie_threshold – Passed to
compute_battles().
- Returns:
Square DataFrame (models × models). Cell (i,j) = fraction of datasets where model i beats model j. Diagonal is NaN. Rows sorted by descending average win-rate.
- evaluma.methods.elo._fit_elo(battles: pandas.DataFrame, models: list | None = None, scale: float = 400.0, base: float = 10.0, init_rating: float = 1000.0) pandas.Series#
Fit MLE ELO ratings from a battles DataFrame.
Uses logistic regression with per-battle sample weights. Falls back to iterative ELO (K=32) when logistic regression fails (e.g. total dominance).
- Parameters:
battles – Output of
compute_battles().models – Explicit model list. When
None, inferred from battles. Must be provided when battles may be empty.scale – ELO scale parameter (default 400).
base – Logarithm base (default 10).
init_rating – Initial rating for iterative fallback (default 1000).
- Returns:
ELO ratings indexed by model name.
- Return type:
pd.Series
- evaluma.methods.elo._iterative_elo(battles: pandas.DataFrame, model_idx: dict, n: int, scale: float, base: float, init_rating: float, K: float = 32.0) numpy.ndarray#
Iterative ELO update rule fallback.
- evaluma.methods.elo.compute_battles_from_runs(raw_runs: pandas.DataFrame, tie_threshold: float | None = None, metric_direction: dict[str, str] | None = None, norm_bounds: tuple | None = None) pandas.DataFrame#
Generate battles from per-seed long-format data.
Pairs all models within each (dataset, seed) combination. Each dataset contributes total weight 1 regardless of seed count.
Per-seed scores are made “higher = better” before comparison. When
norm_boundsis supplied, scores are min-max normalized to[0, 1]per dataset (sotie_thresholdis on the normalized scale, matching the win-rate matrix); otherwise"min"datasets are simply negated and raw scores are compared (TabArena convention).- Parameters:
raw_runs – Long-format DataFrame with columns
["model", "dataset", "seed", "score"].tie_threshold – Score difference at or below which a pair is a tie.
metric_direction – Maps dataset names to
"min"or"max".norm_bounds – Optional
(low, high)per-dataset bound Series. When given, each seed score is normalized via the same min-max rule as the score matrix before thresholding;"min"datasets are inverted by the normalization itself.
- Returns:
DataFrame with columns
["model_a", "model_b", "outcome", "dataset", "seed", "weight"].
- evaluma.methods.elo.compute_elo(scores: pandas.DataFrame, n_bootstrap: int = 1000, random_state=None, tie_threshold: float | None = None, calibration_model: str | None = None, raw_runs: pandas.DataFrame | None = None, metric_direction: dict[str, str] | None = None, norm_bounds: tuple | None = None) tuple[pandas.DataFrame, pandas.DataFrame]#
Compute MLE ELO ratings with battle-within-task bootstrap CIs.
- Parameters:
scores – Model × dataset normalized score matrix (higher = better).
n_bootstrap – Number of bootstrap replicates for 95% CI. Set to 0 to skip bootstrap (CI columns will be NaN).
random_state – Seed for
numpy.random.default_rng().tie_threshold – Minimum score difference to emit a non-tie battle.
calibration_model – If given, shift all ratings so this model has ELO = 1000.
raw_runs – Long-format per-seed DataFrame with columns
["model", "dataset", "seed", "score"]. When provided, battles are generated per (dataset, seed) viacompute_battles_from_runs()and bootstrap resamples within (dataset, seed) groups.metric_direction – Passed to
compute_battles_from_runs(). Ignored whenraw_runsisNone.norm_bounds – Optional
(low, high)per-dataset bound Series. When provided alongsideraw_runs, per-seed scores are normalized to[0, 1]before battles are formed, sotie_thresholdmatches the normalized scale of the win-rate matrix.Nonekeeps raw seed scores (TabArena convention). Ignored whenraw_runsisNone.
- Returns:
DataFrame with columns
["model", "ELO", "CI_low", "CI_high"]sorted descending by ELO.M×M win-rate matrix as returned by
compute_winrate_matrix().
- Return type:
Tuple of
- Raises:
ValueError – If
calibration_modelis not inscores.index.