evaluma.results

evaluma.results#

Classes#

`AggregateResult`	Result of `aggregate_ranking()`.
`IQMResult`	Result of `iqm_ranking()`.
`BayesianResult`	Result of `bayesian_comparison()`.
`FrequentistResult`	Result of `frequentist_comparison()`.
`ProfileResult`	Result of `performance_profiles()`.

Module Contents#

class evaluma.results.AggregateResult(table: pandas.DataFrame)#

Result of aggregate_ranking().

table#

plot(figsize=None, model_colors=None, title=None, ax=None)#

Render a horizontal bar chart of aggregate scores.

Parameters:

figsize – Figure size (width, height) in inches.
model_colors – List of colors, one per model in table order.
title – Optional axes title.
ax – Existing axes to draw into; a new figure is created if None.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

class evaluma.results.IQMResult(table: pandas.DataFrame)#

Result of iqm_ranking().

table#

plot(figsize=None, model_colors=None, title=None, ax=None)#

Render a horizontal bar chart of IQM scores with CI error bars.

Parameters:

figsize – Matplotlib figure size (width, height) in inches.
model_colors – List of colors, one per model in table order.
title – Optional axes title.
ax – Existing axes to draw into; a new figure is created if None.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

Example

>>> result = bench.iqm_ranking()
>>> fig = result.plot(figsize=(8, 4))

class evaluma.results.BayesianResult(table: pandas.DataFrame, reference=None)#

Result of bayesian_comparison().

table#

reference = None#

plot(title=None)#

Render the comparison result.

In all-pairs mode renders a pairwise heatmap. In reference mode renders a stacked horizontal bar chart sorted by P(model > reference).

Parameters:: title – Optional figure title.
Returns:: The rendered figure.
Return type:: matplotlib.figure.Figure

Example

>>> result = bench.bayesian_comparison()
>>> fig = result.plot()

class evaluma.results.FrequentistResult(table: pandas.DataFrame, avg_ranks: pandas.Series, friedman_statistic: float, friedman_p_value: float, reference=None, alpha=0.05, cd=None)#

Result of frequentist_comparison().

table#

avg_ranks#

friedman_statistic#

friedman_p_value#

reference = None#

alpha = 0.05#

cd = None#

plot(title=None)#

Render the comparison result.

In all-pairs mode renders a CD diagram (Demšar 2006) with the Nemenyi critical difference bracket. In reference mode renders a horizontal bar chart of Holm-corrected p-values.

Parameters:: title – Optional figure title.
Returns:: The rendered figure.
Return type:: matplotlib.figure.Figure

Example

>>> result = bench.frequentist_comparison()
>>> fig = result.plot()

class evaluma.results.ProfileResult(table: pandas.DataFrame)#

Result of performance_profiles().

table#

property aup: pandas.Series#

Area Under the Profile in log₁₀(τ) space (left Riemann sum).

For each model, integrates the step-function profile curve over log₁₀(τ) using consecutive tau breakpoints as the grid:

AUP = Σ (log₁₀(τ_{i+1}) − log₁₀(τ_i)) · ρ(τ_i)

AUP is unnormalized: its scale depends on τ_max (the worst-case ratio in this run). It is meaningful for within-benchmark comparison but not across benchmarks with different τ_max values.

References

Roberts et al. (2022). AutoML Decathlon. Dahl et al. (2023). AlgoPerf. arXiv:2306.07179. Batra et al. (2025). ML-GYM. arXiv:2502.14499.

Returns:: pd.Series indexed by model name, values ≥ 0.

plot(figsize=None, model_colors=None, title=None, ax=None)#

Render Dolan-Moré performance profile curves on a log₁₀(τ) axis.

Parameters:

figsize – Figure size (width, height) in inches.
model_colors – Dict mapping model names to colors, or a list in model order.
title – Optional axes title.
ax – Existing axes to draw into; a new figure is created if None.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

Example

>>> result = bench.performance_profiles()
>>> fig = result.plot(figsize=(8, 5))