evaluma
=======

.. py:module:: evaluma


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/evaluma/_version/index
   /autoapi/evaluma/benchmark/index
   /autoapi/evaluma/cli/index
   /autoapi/evaluma/methods/index
   /autoapi/evaluma/metric_registry/index
   /autoapi/evaluma/normalize/index
   /autoapi/evaluma/plot/index
   /autoapi/evaluma/results/index


Attributes
----------

.. autoapisummary::

   evaluma.__version__


Classes
-------

.. autoapisummary::

   evaluma.Benchmark


Functions
---------

.. autoapisummary::

   evaluma.load_df
   evaluma.load_csv
   evaluma._resolve_metric_type_bounds


Package Contents
----------------

.. py:data:: __version__
   :type:  str

.. py:class:: Benchmark(raw_matrix: pandas.DataFrame, *, norm_ref_low=None, norm_ref_high=None, metric_direction=None, raw_runs=None)

   Container for a normalized model-vs-dataset score matrix.

   After construction the normalized scores are available as ``scores_``.
   Use the analysis methods to compute rankings, comparisons, and profiles.


   .. py:attribute:: _raw


   .. py:attribute:: _norm_ref_low
      :value: None


   .. py:attribute:: _norm_ref_high
      :value: None


   .. py:attribute:: _metric_direction
      :value: None


   .. py:attribute:: _raw_runs
      :value: None


   .. py:method:: _normalize(matrix)


   .. py:property:: scores_

      Normalized model × dataset score matrix.


   .. py:method:: _new(raw_matrix, raw_runs=None)


   .. py:method:: select_models(models)

      Subset the benchmark to the given models.

      :param models: List of model names to retain.

      :returns: New benchmark containing only the selected models.
      :rtype: Benchmark


   .. py:method:: select_datasets(datasets)

      Subset the benchmark to the given datasets.

      :param datasets: List of dataset names to retain.

      :returns: New benchmark containing only the selected datasets.
      :rtype: Benchmark


   .. py:method:: drop_incomplete()

      Remove models that have missing scores for any dataset.

      :returns: New benchmark with incomplete models removed.
      :rtype: Benchmark


   .. py:method:: iqm_ranking(n_bootstrap=1000, random_state=None)

      Compute IQM rankings with stratified bootstrap confidence intervals.

      Implements the Agarwal et al. 2021 (rliable) IQM on the flat
      run×dataset score array. Requires multiple seeds; use
      ``aggregate_ranking()`` for single-run data.

      :param n_bootstrap: Number of bootstrap samples for the 95 % CI.
      :param random_state: Seed for the random number generator.

      :returns: Result with ``.table`` and ``.plot()``.
      :rtype: IQMResult

      :raises ValueError: If no seed data is available (``_raw_runs is None``).


   .. py:method:: aggregate_ranking(agg='trimmed_mean')

      Compute a point-estimate descriptive ranking (no CI).

      Works on any benchmark regardless of whether seed data is present.

      :param agg: Aggregation mode — ``"trimmed_mean"`` (default), ``"mean"``,
                  or ``"median"``.

      :returns: Result with ``.table`` and ``.plot()``.
      :rtype: AggregateResult

      :raises ValueError: If ``agg`` is not a supported mode.


   .. py:method:: bayesian_comparison(rope=0.01, reference=None, pairs=None, random_state=None)

      Compute pairwise Bayesian comparisons via signed-rank test.

      :param rope: Region of practical equivalence half-width.
      :param reference: If given, only compare each other model against this
                        one.
      :param pairs: Explicit list of ``(model_a, model_b)`` pairs to test.
                    Overrides ``reference``.
      :param random_state: Seed for baycomp's sampler.

      :returns: Result with ``.table`` and ``.plot()``.
      :rtype: BayesianResult


   .. py:method:: performance_profiles()

      Compute Dolan-Moré performance profiles.

      :returns: Result with ``.table`` and ``.plot()``.
      :rtype: ProfileResult

      :raises ValueError: If any raw score is zero or negative.


.. py:function:: load_df(df, *, model='model', dataset='dataset', metric='metric', score='score', seed=None, metric_type_bounds=None, norm_ref_low=None, norm_ref_high=None, metric_direction=None, drop_incomplete=False)

   Load a DataFrame and return a ready-to-use Benchmark object.

   :param df: A pandas DataFrame in long format (one row per model/dataset pair).
              To load from a CSV file, use :func:`evaluma.load_csv` instead.
   :param model: Column name for the model identifier.
   :param dataset: Column name for the dataset identifier.
   :param metric: Column name for the metric identifier.
   :param score: Column name for the score values.
   :param seed: Column name for the random seed. When provided, all seed rows
                are preserved and a ``seed`` column is included in the loaded
                DataFrame.
   :param metric_type_bounds: Dict mapping metric names to ``(low, high)``
                              bound tuples. ``high`` may be a model name string, resolved
                              per-dataset to that model's score. When provided, ``norm_ref_low``
                              and ``norm_ref_high`` must be ``None``. Metric direction is inferred
                              from the built-in registry; unknown metrics raise ``ValueError``.
                              Bounded metrics not listed here (e.g. accuracy, iou, f1) use their
                              natural ``[0, 1]`` bounds automatically. Unbounded metrics (rmse,
                              mae, mse) must be listed; omitting them raises ``ValueError``.
   :param norm_ref_low: Lower normalization reference — scalar, model name,
                        or per-dataset dict. If ``None``, the per-dataset minimum is
                        used and a ``UserWarning`` is emitted. Cannot be combined with
                        ``metric_type_bounds``.
   :param norm_ref_high: Upper normalization reference, same format as
                         ``norm_ref_low``. If ``None``, the per-dataset maximum is used.
                         Cannot be combined with ``metric_type_bounds``.
   :param metric_direction: Dict mapping dataset names to ``"min"`` or
                            ``"max"``. When used with ``metric_type_bounds``, these entries
                            take precedence over the registry-inferred direction. Without
                            ``metric_type_bounds``, datasets mapped to ``"min"`` are negated
                            before normalization so that higher is always better.
   :param drop_incomplete: If ``True``, silently drop models with missing
                           scores instead of raising.

   :returns: Normalized benchmark ready for analysis.
   :rtype: Benchmark

   :raises TypeError: If ``df`` is not a pandas DataFrame.
   :raises ValueError: If ``metric_type_bounds`` is provided together with
       ``norm_ref_low`` or ``norm_ref_high``.
   :raises ValueError: If the data contains more than one metric per
       (model, dataset) pair, or if the score matrix is incomplete
       and ``drop_incomplete`` is ``False``.
   :raises ValueError: If a metric referenced by a dataset is not in the
       registry and not covered by ``metric_type_bounds``.
   :raises ValueError: If a regression metric (rmse, mae, mse) is present but
       no upper bound is specified in ``metric_type_bounds``.


.. py:function:: load_csv(path, *, model='model', dataset='dataset', metric='metric', score='score', seed=None, metric_type_bounds=None, norm_ref_low=None, norm_ref_high=None, metric_direction=None, drop_incomplete=False)

   Load a benchmark CSV file and return a ready-to-use Benchmark object.

   :param path: Path to the CSV file.
   :param model: Column name for the model identifier.
   :param dataset: Column name for the dataset identifier.
   :param metric: Column name for the metric identifier.
   :param score: Column name for the score values.
   :param seed: Column name for the random seed.
   :param metric_type_bounds: See :func:`evaluma.load_df`.
   :param norm_ref_low: See :func:`evaluma.load_df`.
   :param norm_ref_high: See :func:`evaluma.load_df`.
   :param metric_direction: See :func:`evaluma.load_df`.
   :param drop_incomplete: See :func:`evaluma.load_df`.

   :returns: Normalized benchmark ready for analysis.
   :rtype: Benchmark


.. py:function:: _resolve_metric_type_bounds(metric_type_bounds, dataset_metric_map, raw_matrix, metric_direction_override)

   Resolve per-dataset normalization bounds and directions from the metric registry.

   For each dataset, consults ``metric_type_bounds`` first, then falls back to the
   built-in registry for metrics with natural bounds. Raises if an unbounded metric
   (rmse, mae, mse) has no entry in ``metric_type_bounds``.

   :param metric_type_bounds: Dict mapping metric names → ``(low, high)`` tuples.
   :param dataset_metric_map: Dict mapping dataset names → metric name strings.
   :param raw_matrix: Model × dataset score DataFrame (used to resolve model-name bounds).
   :param metric_direction_override: Optional dict mapping dataset names →
                                     ``"min"``/``"max"``; these entries override registry-inferred directions.

   :returns: ``(norm_ref_low, norm_ref_high, metric_direction)`` where the first
             two are
             ``pd.Series`` keyed by dataset and the last is a dict (or ``None`` if empty).
   :rtype: tuple