evaluma.methods.frequentist

evaluma.methods.frequentist#

Functions#

`_holm_correction`(→ numpy.ndarray)	Holm (1979) step-down correction. Verified against statsmodels in tests.
`compute_frequentist`(→ evaluma.results.FrequentistResult)	Compute frequentist model comparisons.

Module Contents#

evaluma.methods.frequentist._holm_correction(p_values: list[float]) → numpy.ndarray#: Holm (1979) step-down correction. Verified against statsmodels in tests.

evaluma.methods.frequentist.compute_frequentist(scores_matrix: pandas.DataFrame, *, reference=None, alpha=0.05) → evaluma.results.FrequentistResult#

Compute frequentist model comparisons.

All-pairs mode follows the Demšar (2006) / autorank Friedman + Nemenyi workflow. Reference mode is an evaluma extension: pairwise Wilcoxon signed-rank tests against a named baseline with Holm correction.

Always runs a Friedman omnibus test first. In all-pairs mode, follows with Nemenyi post-hoc and computes the critical difference (CD) scalar. In reference mode, follows with Wilcoxon signed-rank + Holm correction against one reference model.

Note: For k=2, Friedman+Nemenyi is used uniformly rather than the standalone Wilcoxon special-case from Demšar (2006). This is slightly less powerful at small N (5–10) but avoids branching complexity and is statistically valid.

Parameters:

scores_matrix – Normalized model × dataset score matrix (models as row index, datasets as columns).
reference – If provided, only compare every other model against this one. None triggers all-pairs mode.
alpha – Significance level (default 0.05).

Returns:

FrequentistResult

Raises:

ValueError – If k < 2, N < 5, or reference not found in scores_matrix.

References

Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. JMLR, 7, 1–30.