evaluma.results#

Classes#

Module Contents#

class evaluma.results.AggregateResult(table: pandas.DataFrame)#

Result of aggregate_ranking().

table#
plot(figsize=None, model_colors=None, title=None, ax=None)#

Render a horizontal bar chart of aggregate scores.

Parameters:
  • figsize – Figure size (width, height) in inches.

  • model_colors – List of colors, one per model in table order.

  • title – Optional axes title.

  • ax – Existing axes to draw into; a new figure is created if None.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

class evaluma.results.IQMResult(table: pandas.DataFrame)#

Result of iqm_ranking().

table#
plot(figsize=None, model_colors=None, title=None, ax=None)#

Render a horizontal bar chart of IQM scores with CI error bars.

Parameters:
  • figsize – Matplotlib figure size (width, height) in inches.

  • model_colors – List of colors, one per model in table order.

  • title – Optional axes title.

  • ax – Existing axes to draw into; a new figure is created if None.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

Example

>>> result = bench.iqm_ranking()
>>> fig = result.plot(figsize=(8, 4))
class evaluma.results.BayesianResult(table: pandas.DataFrame, reference=None)#

Result of bayesian_comparison().

table#
reference = None#
plot(title=None)#

Render the comparison result.

In all-pairs mode renders a pairwise heatmap. In reference mode renders a stacked horizontal bar chart sorted by P(model > reference).

Parameters:

title – Optional figure title.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

Example

>>> result = bench.bayesian_comparison()
>>> fig = result.plot()
class evaluma.results.FrequentistResult(table: pandas.DataFrame, avg_ranks: pandas.Series, friedman_statistic: float, friedman_p_value: float, reference=None, alpha=0.05, cd=None)#

Result of frequentist_comparison().

table#
avg_ranks#
friedman_statistic#
friedman_p_value#
reference = None#
alpha = 0.05#
cd = None#
plot(title=None)#

Render the comparison result.

In all-pairs mode renders a CD diagram (Demšar 2006) with the Nemenyi critical difference bracket. In reference mode renders a horizontal bar chart of Holm-corrected p-values.

Parameters:

title – Optional figure title.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

Example

>>> result = bench.frequentist_comparison()
>>> fig = result.plot()
class evaluma.results.ProfileResult(table: pandas.DataFrame)#

Result of performance_profiles().

table#
property aup: pandas.Series#

Area Under the Profile in log₁₀(τ) space (left Riemann sum).

For each model, integrates the step-function profile curve over log₁₀(τ) using consecutive tau breakpoints as the grid:

AUP = Σ (log₁₀(τ_{i+1}) − log₁₀(τ_i)) · ρ(τ_i)

AUP is unnormalized: its scale depends on τ_max (the worst-case ratio in this run). It is meaningful for within-benchmark comparison but not across benchmarks with different τ_max values.

References

Roberts et al. (2022). AutoML Decathlon. Dahl et al. (2023). AlgoPerf. arXiv:2306.07179. Batra et al. (2025). ML-GYM. arXiv:2502.14499.

Returns:

pd.Series indexed by model name, values ≥ 0.

plot(figsize=None, model_colors=None, title=None, ax=None)#

Render Dolan-Moré performance profile curves on a log₁₀(τ) axis.

Parameters:
  • figsize – Figure size (width, height) in inches.

  • model_colors – Dict mapping model names to colors, or a list in model order.

  • title – Optional axes title.

  • ax – Existing axes to draw into; a new figure is created if None.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

Example

>>> result = bench.performance_profiles()
>>> fig = result.plot(figsize=(8, 5))