evaluma.methods.elo
===================

.. py:module:: evaluma.methods.elo


Attributes
----------

.. autoapisummary::

   evaluma.methods.elo.logger


Functions
---------

.. autoapisummary::

   evaluma.methods.elo.compute_battles
   evaluma.methods.elo.compute_winrate_matrix
   evaluma.methods.elo._fit_elo
   evaluma.methods.elo._iterative_elo
   evaluma.methods.elo.compute_battles_from_runs
   evaluma.methods.elo.compute_elo


Module Contents
---------------

.. py:data:: logger

.. py:function:: compute_battles(scores: pandas.DataFrame, tie_threshold: float | None = None) -> pandas.DataFrame

   Generate pairwise battles from a normalized score matrix.

   For each dataset, all M*(M-1)/2 model pairs are compared. Every pair
   produces exactly one battle. Pairs within ``tie_threshold`` produce a tie
   (outcome=0.5). Each battle carries a weight of ``1 / n_battles_in_dataset``
   so that all datasets contribute equally to downstream fitting.

   :param scores: Model × dataset normalized score matrix (higher = better).
   :param tie_threshold: Score difference at or below which a pair is called a
                         tie (outcome=0.5). ``None`` means ties occur only on exact equality.

   :returns: DataFrame with columns ``["model_a", "model_b", "outcome", "dataset",
             "weight"]``. ``outcome=1.0`` when model_a wins, ``0.0`` when model_b
             wins, ``0.5`` for a tie.


.. py:function:: compute_winrate_matrix(scores: pandas.DataFrame, tie_threshold: float | None = None) -> pandas.DataFrame

   Compute M×M empirical win-rate matrix with equal dataset weighting.

   :param scores: Model × dataset normalized score matrix.
   :param tie_threshold: Passed to :func:`compute_battles`.

   :returns: Square DataFrame (models × models). Cell (i,j) = fraction of datasets
             where model i beats model j. Diagonal is NaN. Rows sorted by
             descending average win-rate.


.. py:function:: _fit_elo(battles: pandas.DataFrame, models: list | None = None, scale: float = 400.0, base: float = 10.0, init_rating: float = 1000.0) -> pandas.Series

   Fit MLE ELO ratings from a battles DataFrame.

   Uses logistic regression with per-battle sample weights. Falls back to
   iterative ELO (K=32) when logistic regression fails (e.g. total dominance).

   :param battles: Output of :func:`compute_battles`.
   :param models: Explicit model list. When ``None``, inferred from battles.
                  Must be provided when battles may be empty.
   :param scale: ELO scale parameter (default 400).
   :param base: Logarithm base (default 10).
   :param init_rating: Initial rating for iterative fallback (default 1000).

   :returns: ELO ratings indexed by model name.
   :rtype: pd.Series


.. py:function:: _iterative_elo(battles: pandas.DataFrame, model_idx: dict, n: int, scale: float, base: float, init_rating: float, K: float = 32.0) -> numpy.ndarray

   Iterative ELO update rule fallback.


.. py:function:: compute_battles_from_runs(raw_runs: pandas.DataFrame, tie_threshold: float | None = None, metric_direction: dict[str, str] | None = None, norm_bounds: tuple | None = None) -> pandas.DataFrame

   Generate battles from per-seed long-format data.

   Pairs all models within each (dataset, seed) combination. Each dataset
   contributes total weight 1 regardless of seed count.

   Per-seed scores are made "higher = better" before comparison. When
   ``norm_bounds`` is supplied, scores are min-max normalized to ``[0, 1]``
   per dataset (so ``tie_threshold`` is on the normalized scale, matching the
   win-rate matrix); otherwise ``"min"`` datasets are simply negated and raw
   scores are compared (TabArena convention).

   :param raw_runs: Long-format DataFrame with columns
                    ``["model", "dataset", "seed", "score"]``.
   :param tie_threshold: Score difference at or below which a pair is a tie.
   :param metric_direction: Maps dataset names to ``"min"`` or ``"max"``.
   :param norm_bounds: Optional ``(low, high)`` per-dataset bound Series. When
                       given, each seed score is normalized via the same min-max rule as
                       the score matrix before thresholding; ``"min"`` datasets are
                       inverted by the normalization itself.

   :returns: DataFrame with columns ``["model_a", "model_b", "outcome", "dataset",
             "seed", "weight"]``.


.. py:function:: compute_elo(scores: pandas.DataFrame, n_bootstrap: int = 1000, random_state=None, tie_threshold: float | None = None, calibration_model: str | None = None, raw_runs: pandas.DataFrame | None = None, metric_direction: dict[str, str] | None = None, norm_bounds: tuple | None = None) -> tuple[pandas.DataFrame, pandas.DataFrame]

   Compute MLE ELO ratings with battle-within-task bootstrap CIs.

   :param scores: Model × dataset normalized score matrix (higher = better).
   :param n_bootstrap: Number of bootstrap replicates for 95% CI. Set to 0 to
                       skip bootstrap (CI columns will be NaN).
   :param random_state: Seed for :func:`numpy.random.default_rng`.
   :param tie_threshold: Minimum score difference to emit a non-tie battle.
   :param calibration_model: If given, shift all ratings so this model has
                             ELO = 1000.
   :param raw_runs: Long-format per-seed DataFrame with columns
                    ``["model", "dataset", "seed", "score"]``. When provided, battles
                    are generated per (dataset, seed) via
                    :func:`compute_battles_from_runs` and bootstrap resamples within
                    (dataset, seed) groups.
   :param metric_direction: Passed to :func:`compute_battles_from_runs`. Ignored
                            when ``raw_runs`` is ``None``.
   :param norm_bounds: Optional ``(low, high)`` per-dataset bound Series. When
                       provided alongside ``raw_runs``, per-seed scores are normalized to
                       ``[0, 1]`` before battles are formed, so ``tie_threshold`` matches
                       the normalized scale of the win-rate matrix. ``None`` keeps raw
                       seed scores (TabArena convention). Ignored when ``raw_runs`` is
                       ``None``.

   :returns:

             - DataFrame with columns ``["model", "ELO", "CI_low", "CI_high"]``
               sorted descending by ELO.
             - M×M win-rate matrix as returned by :func:`compute_winrate_matrix`.
   :rtype: Tuple of

   :raises ValueError: If ``calibration_model`` is not in ``scores.index``.


