evaluma.methods.frequentist
===========================

.. py:module:: evaluma.methods.frequentist


Functions
---------

.. autoapisummary::

   evaluma.methods.frequentist._holm_correction
   evaluma.methods.frequentist.compute_frequentist


Module Contents
---------------

.. py:function:: _holm_correction(p_values: list[float]) -> numpy.ndarray

   Holm (1979) step-down correction. Verified against statsmodels in tests.


.. py:function:: compute_frequentist(scores_matrix: pandas.DataFrame, *, reference=None, alpha=0.05) -> evaluma.results.FrequentistResult

   Compute frequentist model comparisons.

   All-pairs mode follows the Demšar (2006) / autorank Friedman + Nemenyi
   workflow. Reference mode is an evaluma extension: pairwise Wilcoxon
   signed-rank tests against a named baseline with Holm correction.

   Always runs a Friedman omnibus test first. In all-pairs mode, follows with
   Nemenyi post-hoc and computes the critical difference (CD) scalar. In
   reference mode, follows with Wilcoxon signed-rank + Holm correction against
   one reference model.

   Note: For k=2, Friedman+Nemenyi is used uniformly rather than the standalone
   Wilcoxon special-case from Demšar (2006). This is slightly less powerful at
   small N (5–10) but avoids branching complexity and is statistically valid.

   :param scores_matrix: Normalized model × dataset score matrix (models as row
                         index, datasets as columns).
   :param reference: If provided, only compare every other model against this one.
                     ``None`` triggers all-pairs mode.
   :param alpha: Significance level (default 0.05).

   :returns: FrequentistResult

   :raises ValueError: If k < 2, N < 5, or reference not found in scores_matrix.

   .. rubric:: References

   Demšar, J. (2006). Statistical Comparisons of Classifiers over
   Multiple Data Sets. *JMLR*, 7, 1–30.


