evaluma.benchmark
=================

.. py:module:: evaluma.benchmark


Classes
-------

.. autoapisummary::

   evaluma.benchmark.Benchmark


Module Contents
---------------

.. py:class:: Benchmark(raw_matrix: pandas.DataFrame, *, norm_ref_low=None, norm_ref_high=None, metric_direction=None, raw_runs=None)

   Container for a normalized model-vs-dataset score matrix.

   After construction the normalized scores are available as ``scores_``.
   Use the analysis methods to compute rankings, comparisons, and profiles.


   .. py:attribute:: _raw


   .. py:attribute:: _norm_ref_low
      :value: None


   .. py:attribute:: _norm_ref_high
      :value: None


   .. py:attribute:: _metric_direction
      :value: None


   .. py:attribute:: _raw_runs
      :value: None


   .. py:method:: _normalize(matrix)


   .. py:property:: scores_

      Normalized model × dataset score matrix.


   .. py:method:: _new(raw_matrix, raw_runs=None)


   .. py:method:: select_models(models)

      Subset the benchmark to the given models.

      :param models: List of model names to retain.

      :returns: New benchmark containing only the selected models.
      :rtype: Benchmark


   .. py:method:: select_datasets(datasets)

      Subset the benchmark to the given datasets.

      :param datasets: List of dataset names to retain.

      :returns: New benchmark containing only the selected datasets.
      :rtype: Benchmark


   .. py:method:: drop_incomplete()

      Remove models that have missing scores for any dataset.

      :returns: New benchmark with incomplete models removed.
      :rtype: Benchmark


   .. py:method:: iqm_ranking(n_bootstrap=1000, random_state=None)

      Compute IQM rankings with stratified bootstrap confidence intervals.

      Implements the Agarwal et al. 2021 (rliable) IQM on the flat
      run×dataset score array. Requires multiple seeds; use
      ``aggregate_ranking()`` for single-run data.

      :param n_bootstrap: Number of bootstrap samples for the 95 % CI.
      :param random_state: Seed for the random number generator.

      :returns: Result with ``.table`` and ``.plot()``.
      :rtype: IQMResult

      :raises ValueError: If no seed data is available (``_raw_runs is None``).


   .. py:method:: aggregate_ranking(agg='trimmed_mean')

      Compute a point-estimate descriptive ranking (no CI).

      Works on any benchmark regardless of whether seed data is present.

      .. note::
          This is a **descriptive point estimate only** (no CI). The
          trimmed-mean variant trims across datasets, not across seeds; with
          fewer than ~10 datasets the 25% trim is aggressive (e.g. 5
          datasets → only 3 contribute). Treat results as exploratory. For a
          statistically grounded ranking with uncertainty, use
          ``iqm_ranking()`` (requires multiple seeds).

      :param agg: Aggregation mode — ``"trimmed_mean"`` (default), ``"mean"``,
                  or ``"median"``.

      :returns: Result with ``.table`` and ``.plot()``.
      :rtype: AggregateResult

      :raises ValueError: If ``agg`` is not a supported mode.


   .. py:method:: bayesian_comparison(rope=0.01, reference=None, pairs=None, random_state=None)

      Compute pairwise Bayesian comparisons via signed-rank test.

      :param rope: Region of practical equivalence half-width **in normalized
                   score space (0–1)**. Differences smaller than ``rope`` are
                   treated as practically equivalent.
      :param reference: If given, only compare each other model against this
                        one.
      :param pairs: Explicit list of ``(model_a, model_b)`` pairs to test.
                    Overrides ``reference``.
      :param random_state: Seed for baycomp's sampler.

      :returns: Result with ``.table`` and ``.plot()``.
      :rtype: BayesianResult


   .. py:method:: frequentist_comparison(reference=None, alpha=0.05)

      Compute frequentist model comparisons.

      All-pairs mode follows the Demšar (2006) / autorank Friedman + Nemenyi
      workflow. Reference mode is an evaluma extension: pairwise Wilcoxon
      signed-rank tests against a named baseline with Holm correction.

      Runs a Friedman omnibus test first, then either Nemenyi post-hoc
      (all-pairs mode) or Wilcoxon + Holm correction (reference mode).

      :param reference: If given, only compare each other model against this one
                        using Wilcoxon + Holm. ``None`` triggers all-pairs Nemenyi mode.
      :param alpha: Significance level for the ``significant`` column (default 0.05).

      :returns: Result with ``.table`` and ``.plot()``.
      :rtype: FrequentistResult

      :raises ValueError: If fewer than 5 datasets are present.

      .. rubric:: References

      Demšar, J. (2006). Statistical Comparisons of Classifiers over
      Multiple Data Sets. *JMLR*, 7, 1–30.


   .. py:method:: performance_profiles()

      Compute Dolan-Moré performance profiles.

      Profiles are computed on the raw (un-normalized) score matrix. All raw
      values must be strictly positive; see ``Raises``.

      :returns: Result with ``.table`` and ``.plot()``.
      :rtype: ProfileResult

      :raises ValueError: If any raw score is zero or negative.