evaluma.methods.frequentist#
Functions#
|
Holm (1979) step-down correction. Verified against statsmodels in tests. |
|
Compute frequentist model comparisons. |
Module Contents#
- evaluma.methods.frequentist._holm_correction(p_values: list[float]) numpy.ndarray#
Holm (1979) step-down correction. Verified against statsmodels in tests.
- evaluma.methods.frequentist.compute_frequentist(scores_matrix: pandas.DataFrame, *, reference=None, alpha=0.05) evaluma.results.FrequentistResult#
Compute frequentist model comparisons.
All-pairs mode follows the Demšar (2006) / autorank Friedman + Nemenyi workflow. Reference mode is an evaluma extension: pairwise Wilcoxon signed-rank tests against a named baseline with Holm correction.
Always runs a Friedman omnibus test first. In all-pairs mode, follows with Nemenyi post-hoc and computes the critical difference (CD) scalar. In reference mode, follows with Wilcoxon signed-rank + Holm correction against one reference model.
Note: For k=2, Friedman+Nemenyi is used uniformly rather than the standalone Wilcoxon special-case from Demšar (2006). This is slightly less powerful at small N (5–10) but avoids branching complexity and is statistically valid.
- Parameters:
scores_matrix – Normalized model × dataset score matrix (models as row index, datasets as columns).
reference – If provided, only compare every other model against this one.
Nonetriggers all-pairs mode.alpha – Significance level (default 0.05).
- Returns:
FrequentistResult
- Raises:
ValueError – If k < 2, N < 5, or reference not found in scores_matrix.
References
Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. JMLR, 7, 1–30.