Frequentist vs Bayesian Model Comparison#

When you finish a benchmark run, two different questions are worth asking:

  1. “Is there a significant difference between models A and B?” — The frequentist Friedman + Nemenyi test gives you a p-value. If the adjusted pairwise p-value falls below α, you reject the null that the rank distributions of those two models are exchangeable under the Friedman framework.

  2. “How probable is it that model A is better than B on a new dataset?” — The Bayesian signed-rank test gives you a posterior probability. P(A > B) = 0.85 means: given the data, there is an 85 % probability that A outperforms B on a fresh dataset.

These are complementary, not competing, perspectives. This tutorial runs both on the same benchmark and shows when they agree, when they diverge, and which to use in practice.

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import evaluma
Matplotlib is building the font cache; this may take a moment.

Setup: a shared benchmark#

rng = np.random.RandomState(42)
datasets = [f"D{i:02d}" for i in range(1, 11)]

# Model-A is consistently better; Model-B and Model-C are near-identical.
models_scores = {
    "Model-A": np.clip(rng.normal(0.80, 0.06, 10), 0, 1),
    "Model-B": np.clip(rng.normal(0.68, 0.06, 10), 0, 1),
    "Model-C": np.clip(rng.normal(0.66, 0.06, 10), 0, 1),
}

rows = [
    {"model": model, "dataset": d, "metric": "acc", "score": round(float(s), 4)}
    for model, scores in models_scores.items()
    for d, s in zip(datasets, scores)
]
bench = evaluma.load_df(
    pd.DataFrame(rows),
    model="model", dataset="dataset", metric="metric", score="score",
    norm_ref_low=0.0, norm_ref_high=1.0,
)

# Preview normalised scores
bench.scores_
D01 D02 D03 D04 D05 D06 D07 D08 D09 D10
model
Model-A 0.8298 0.7917 0.8389 0.8914 0.7860 0.7860 0.8948 0.8460 0.7718 0.8326
Model-B 0.6522 0.6521 0.6945 0.5652 0.5765 0.6463 0.6192 0.6989 0.6255 0.5953
Model-C 0.7479 0.6465 0.6641 0.5745 0.6273 0.6667 0.5909 0.6825 0.6240 0.6425

Frequentist: Friedman + Nemenyi#

freq_result = bench.frequentist_comparison(alpha=0.05)
print(f"Friedman p = {freq_result.friedman_p_value:.4f},  CD = {freq_result.cd:.3f}")
freq_result.table[["model_a", "model_b", "rank_diff", "p_value", "significant"]]
Friedman p = 0.0006,  CD = 1.048
model_a model_b rank_diff p_value significant
0 Model-A Model-B 1.5 0.002296 True
1 Model-A Model-C 1.5 0.002296 True
2 Model-B Model-C 0.0 1.000000 False
fig = freq_result.plot(title="Critical Difference Diagram")
plt.tight_layout()
plt.show()
../_images/ffb0cdae21afcc00b88e672ee049094d97098ff16c85efcdc8f6075a1c68cbbe.png

Bayesian: posterior probability of superiority#

bayes_result = bench.bayesian_comparison(rope=0.01, random_state=0)
bayes_result.table[["model_a", "model_b", "p_a_better", "p_equiv", "p_b_better"]]
model_a model_b p_a_better p_equiv p_b_better
0 Model-A Model-B 1.00000 0.00000 0.0000
1 Model-A Model-C 1.00000 0.00000 0.0000
2 Model-B Model-C 0.12232 0.19598 0.6817
fig = bayes_result.plot(title="Bayesian pairwise comparison")
plt.tight_layout()
plt.show()
../_images/a41e1bdfcd8373c5b72dda06ad130722b2006e2f5e39da751af26d423a3f6b89.png

Side-by-side: where they agree#

merged = freq_result.table[["model_a", "model_b", "p_value", "significant"]].merge(
    bayes_result.table[["model_a", "model_b", "p_a_better", "p_equiv", "p_b_better"]],
    on=["model_a", "model_b"],
    how="left",
)
merged
model_a model_b p_value significant p_a_better p_equiv p_b_better
0 Model-A Model-B 0.002296 True 1.00000 0.00000 0.0000
1 Model-A Model-C 0.002296 True 1.00000 0.00000 0.0000
2 Model-B Model-C 1.000000 False 0.12232 0.19598 0.6817

For the A–B and A–C pairs (where Model-A clearly dominates), both methods agree: the difference is significant (Nemenyi p < 0.05) and Model-A is very likely better (P(A > B) close to 1).

For the B–C pair (near-identical models), the two methods tell slightly different stories:

  • Frequentist: significant = False — the rank gap between B and C does not exceed the critical difference.

  • Bayesian: p_b_better may still be non-trivial (e.g. 0.40) — meaning there is a non-negligible probability that C is better, even if we cannot call it “significant”.

When they diverge#

Divergence typically happens in two situations:

1. Small N (few datasets)#

With only 5–6 datasets, the Nemenyi test has limited power. frequentist_comparison requires at least 5 datasets; below that it raises a ValueError. The Bayesian test still returns meaningful posteriors at any N.

rows_small = [
    {"model": model, "dataset": d, "metric": "acc", "score": round(float(s), 4)}
    for model, scores in models_scores.items()
    for d, s in zip(datasets[:5], list(scores)[:5])
]
bench_small = evaluma.load_df(
    pd.DataFrame(rows_small),
    model="model", dataset="dataset", metric="metric", score="score",
    norm_ref_low=0.0, norm_ref_high=1.0,
)

freq_small = bench_small.frequentist_comparison(alpha=0.05)
bayes_small = bench_small.bayesian_comparison(rope=0.01, random_state=0)

print("Frequentist (N=5):")
print(freq_small.table[["model_a", "model_b", "p_value", "significant"]].to_string(index=False))
print()
print("Bayesian (N=5):")
print(bayes_small.table[["model_a", "model_b", "p_a_better", "p_equiv", "p_b_better"]].to_string(index=False))
Frequentist (N=5):
model_a model_b  p_value  significant
Model-A Model-B 0.030663         True
Model-A Model-C 0.068887        False
Model-B Model-C 0.946370        False

Bayesian (N=5):
model_a model_b  p_a_better  p_equiv  p_b_better
Model-A Model-B     0.99944  0.00056      0.0000
Model-A Model-C     0.99944  0.00056      0.0000
Model-B Model-C     0.12318  0.17362      0.7032

With N=5 the Nemenyi test may not reject any null hypothesis. The Bayesian posteriors still reflect the structure of the data.

2. Borderline cases near the ROPE#

When two models differ by less than the ROPE (region of practical equivalence), the Bayesian test channels probability into p_equiv. The Wilcoxon test may still technically reject the null (because statistical significance says nothing about practical relevance).

# Models within 0.02 of each other
rng2 = np.random.RandomState(7)
rows_close = [
    {"model": "Close-A", "dataset": d, "metric": "acc", "score": round(float(s), 4)}
    for d, s in zip(datasets, np.clip(rng2.normal(0.70, 0.03, 10), 0, 1))
] + [
    {"model": "Close-B", "dataset": d, "metric": "acc", "score": round(float(s), 4)}
    for d, s in zip(datasets, np.clip(rng2.normal(0.69, 0.03, 10), 0, 1))
]
bench_close = evaluma.load_df(
    pd.DataFrame(rows_close),
    model="model", dataset="dataset", metric="metric", score="score",
    norm_ref_low=0.0, norm_ref_high=1.0,
)

freq_close = bench_close.frequentist_comparison(alpha=0.05)
bayes_close = bench_close.bayesian_comparison(rope=0.05, random_state=0)

print("Frequentist (Nemenyi):")
print(freq_close.table[["model_a", "model_b", "p_value", "significant"]].to_string(index=False))
print()
print("Bayesian (rope=0.05):")
print(bayes_close.table[["model_a", "model_b", "p_a_better", "p_equiv", "p_b_better"]].to_string(index=False))
Frequentist (Nemenyi):
model_a model_b  p_value  significant
Close-A Close-B 0.527089        False

Bayesian (rope=0.05):
model_a model_b  p_a_better  p_equiv  p_b_better
Close-A Close-B     0.01244  0.98756         0.0

Here the Bayesian test may show p_equiv dominating (the models are practically equivalent), while the Nemenyi test might be insignificant for a different reason — insufficient power. Note that evaluma uses the same Friedman + Nemenyi path even for k=2, rather than the standalone Wilcoxon special-case from Demšar (2006), so the reported p-value comes from Nemenyi.

Practical guidance#

Use the frequentist path when you need a p-value or CD diagram for a venue; use the Bayesian path when you want a probability statement (“P(A > B) = 0.85”). The frequentist path requires N ≥ 5 datasets — results at that boundary should be treated cautiously because the Friedman chi-squared approximation is coarse at small N. The Bayesian test returns meaningful posteriors at any N. When two models differ by less than your ROPE, the Bayesian test explicitly captures that practical equivalence; the frequentist test has no equivalent mechanism.

Running both in a single workflow#

# Full analysis pipeline
freq_res = bench.frequentist_comparison(alpha=0.05)
bayesian_res = bench.bayesian_comparison(rope=0.01, random_state=0)

fig = freq_res.plot(title="Frequentist: Critical Difference")
plt.tight_layout()
plt.show()

fig = bayesian_res.plot(title="Bayesian: Posterior Probabilities")
plt.tight_layout()
plt.show()
../_images/73ca1d8b6a55cf04ec3435ac3eb3559a1bf5cbd81ec81ffc399a31110fd569f8.png ../_images/5e7e7bff686ead30061ade22201c9e7236312c2c9863a577c18888ef37902d58.png

References#

  • Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. JMLR, 7, 1–30.

  • Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.

  • Benavoli, A., Corani, G., Demšar, J., & Zaffalon, M. (2017). Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. JMLR, 18(77), 1–36.