evaluma.results#
Classes#
Result of |
|
Result of |
|
Result of |
|
Result of |
|
Result of |
Module Contents#
- class evaluma.results.AggregateResult(table: pandas.DataFrame)#
Result of
aggregate_ranking().- table#
- plot(figsize=None, model_colors=None, title=None, ax=None)#
Render a horizontal bar chart of aggregate scores.
- Parameters:
figsize – Figure size
(width, height)in inches.model_colors – List of colors, one per model in table order.
title – Optional axes title.
ax – Existing axes to draw into; a new figure is created if
None.
- Returns:
The rendered figure.
- Return type:
matplotlib.figure.Figure
- class evaluma.results.IQMResult(table: pandas.DataFrame)#
Result of
iqm_ranking().- table#
- plot(figsize=None, model_colors=None, title=None, ax=None)#
Render a horizontal bar chart of IQM scores with CI error bars.
- Parameters:
figsize – Matplotlib figure size
(width, height)in inches.model_colors – List of colors, one per model in table order.
title – Optional axes title.
ax – Existing axes to draw into; a new figure is created if
None.
- Returns:
The rendered figure.
- Return type:
matplotlib.figure.Figure
Example
>>> result = bench.iqm_ranking() >>> fig = result.plot(figsize=(8, 4))
- class evaluma.results.BayesianResult(table: pandas.DataFrame, reference=None)#
Result of
bayesian_comparison().- table#
- reference = None#
- plot(title=None)#
Render the comparison result.
In all-pairs mode renders a pairwise heatmap. In reference mode renders a stacked horizontal bar chart sorted by P(model > reference).
- Parameters:
title – Optional figure title.
- Returns:
The rendered figure.
- Return type:
matplotlib.figure.Figure
Example
>>> result = bench.bayesian_comparison() >>> fig = result.plot()
- class evaluma.results.FrequentistResult(table: pandas.DataFrame, avg_ranks: pandas.Series, friedman_statistic: float, friedman_p_value: float, reference=None, alpha=0.05, cd=None)#
Result of
frequentist_comparison().- table#
- avg_ranks#
- friedman_statistic#
- friedman_p_value#
- reference = None#
- alpha = 0.05#
- cd = None#
- plot(title=None)#
Render the comparison result.
In all-pairs mode renders a CD diagram (Demšar 2006) with the Nemenyi critical difference bracket. In reference mode renders a horizontal bar chart of Holm-corrected p-values.
- Parameters:
title – Optional figure title.
- Returns:
The rendered figure.
- Return type:
matplotlib.figure.Figure
Example
>>> result = bench.frequentist_comparison() >>> fig = result.plot()
- class evaluma.results.ProfileResult(table: pandas.DataFrame)#
Result of
performance_profiles().- table#
- property aup: pandas.Series#
Area Under the Profile in log₁₀(τ) space (left Riemann sum).
For each model, integrates the step-function profile curve over log₁₀(τ) using consecutive tau breakpoints as the grid:
AUP = Σ (log₁₀(τ_{i+1}) − log₁₀(τ_i)) · ρ(τ_i)
AUP is unnormalized: its scale depends on τ_max (the worst-case ratio in this run). It is meaningful for within-benchmark comparison but not across benchmarks with different τ_max values.
References
Roberts et al. (2022). AutoML Decathlon. Dahl et al. (2023). AlgoPerf. arXiv:2306.07179. Batra et al. (2025). ML-GYM. arXiv:2502.14499.
- Returns:
pd.Series indexed by model name, values ≥ 0.
- plot(figsize=None, model_colors=None, title=None, ax=None)#
Render Dolan-Moré performance profile curves on a log₁₀(τ) axis.
- Parameters:
figsize – Figure size
(width, height)in inches.model_colors – Dict mapping model names to colors, or a list in model order.
title – Optional axes title.
ax – Existing axes to draw into; a new figure is created if
None.
- Returns:
The rendered figure.
- Return type:
matplotlib.figure.Figure
Example
>>> result = bench.performance_profiles() >>> fig = result.plot(figsize=(8, 5))