evaluma.results
===============

.. py:module:: evaluma.results


Classes
-------

.. autoapisummary::

   evaluma.results.AggregateResult
   evaluma.results.IQMResult
   evaluma.results.BayesianResult
   evaluma.results.FrequentistResult
   evaluma.results.ProfileResult


Module Contents
---------------

.. py:class:: AggregateResult(table: pandas.DataFrame)

   Result of :meth:`~evaluma.benchmark.Benchmark.aggregate_ranking`.


   .. py:attribute:: table


   .. py:method:: plot(figsize=None, model_colors=None, title=None, ax=None)

      Render a horizontal bar chart of aggregate scores.

      :param figsize: Figure size ``(width, height)`` in inches.
      :param model_colors: List of colors, one per model in table order.
      :param title: Optional axes title.
      :param ax: Existing axes to draw into; a new figure is created if
                 ``None``.

      :returns: The rendered figure.
      :rtype: matplotlib.figure.Figure


.. py:class:: IQMResult(table: pandas.DataFrame)

   Result of :meth:`~evaluma.benchmark.Benchmark.iqm_ranking`.


   .. py:attribute:: table


   .. py:method:: plot(figsize=None, model_colors=None, title=None, ax=None)

      Render a horizontal bar chart of IQM scores with CI error bars.

      :param figsize: Matplotlib figure size ``(width, height)`` in inches.
      :param model_colors: List of colors, one per model in table order.
      :param title: Optional axes title.
      :param ax: Existing axes to draw into; a new figure is created if
                 ``None``.

      :returns: The rendered figure.
      :rtype: matplotlib.figure.Figure

      .. rubric:: Example

      >>> result = bench.iqm_ranking()
      >>> fig = result.plot(figsize=(8, 4))


.. py:class:: BayesianResult(table: pandas.DataFrame, reference=None)

   Result of :meth:`~evaluma.benchmark.Benchmark.bayesian_comparison`.


   .. py:attribute:: table


   .. py:attribute:: reference
      :value: None


   .. py:method:: plot(title=None)

      Render the comparison result.

      In all-pairs mode renders a pairwise heatmap. In reference mode
      renders a stacked horizontal bar chart sorted by P(model > reference).

      :param title: Optional figure title.

      :returns: The rendered figure.
      :rtype: matplotlib.figure.Figure

      .. rubric:: Example

      >>> result = bench.bayesian_comparison()
      >>> fig = result.plot()


.. py:class:: FrequentistResult(table: pandas.DataFrame, avg_ranks: pandas.Series, friedman_statistic: float, friedman_p_value: float, reference=None, alpha=0.05, cd=None)

   Result of :meth:`~evaluma.benchmark.Benchmark.frequentist_comparison`.


   .. py:attribute:: table


   .. py:attribute:: avg_ranks


   .. py:attribute:: friedman_statistic


   .. py:attribute:: friedman_p_value


   .. py:attribute:: reference
      :value: None


   .. py:attribute:: alpha
      :value: 0.05


   .. py:attribute:: cd
      :value: None


   .. py:method:: plot(title=None)

      Render the comparison result.

      In all-pairs mode renders a CD diagram (Demšar 2006) with the Nemenyi
      critical difference bracket. In reference mode renders a horizontal bar
      chart of Holm-corrected p-values.

      :param title: Optional figure title.

      :returns: The rendered figure.
      :rtype: matplotlib.figure.Figure

      .. rubric:: Example

      >>> result = bench.frequentist_comparison()
      >>> fig = result.plot()


.. py:class:: ProfileResult(table: pandas.DataFrame)

   Result of :meth:`~evaluma.benchmark.Benchmark.performance_profiles`.


   .. py:attribute:: table


   .. py:property:: aup
      :type: pandas.Series


      Area Under the Profile in log₁₀(τ) space (left Riemann sum).

      For each model, integrates the step-function profile curve over
      log₁₀(τ) using consecutive tau breakpoints as the grid:

          AUP = Σ (log₁₀(τ_{i+1}) − log₁₀(τ_i)) · ρ(τ_i)

      AUP is unnormalized: its scale depends on τ_max (the worst-case ratio
      in this run). It is meaningful for within-benchmark comparison but not
      across benchmarks with different τ_max values.

      .. rubric:: References

      Roberts et al. (2022). AutoML Decathlon.
      Dahl et al. (2023). AlgoPerf. arXiv:2306.07179.
      Batra et al. (2025). ML-GYM. arXiv:2502.14499.

      :returns: pd.Series indexed by model name, values ≥ 0.


   .. py:method:: plot(figsize=None, model_colors=None, title=None, ax=None)

      Render Dolan-Moré performance profile curves on a log₁₀(τ) axis.

      :param figsize: Figure size ``(width, height)`` in inches.
      :param model_colors: Dict mapping model names to colors, or a list in
                           model order.
      :param title: Optional axes title.
      :param ax: Existing axes to draw into; a new figure is created if
                 ``None``.

      :returns: The rendered figure.
      :rtype: matplotlib.figure.Figure

      .. rubric:: Example

      >>> result = bench.performance_profiles()
      >>> fig = result.plot(figsize=(8, 5))