evaluma.plot

evaluma.plot#

Functions#

`plot_aggregate_ranking`(table, *[, figsize, ...])	Render aggregate scores as a horizontal bar chart (no CI whiskers).
`plot_iqm_ranking`(table, *[, figsize, model_colors, ...])	Render IQM scores as a horizontal bar chart with CI error bars.
`plot_bayesian_heatmap`(table, *[, title, figsize])	Render Bayesian pairwise probabilities as a matplotlib heatmap.
`plot_bayesian_reference_bars`(table, reference, *[, ...])	Render Bayesian comparison against a reference as stacked horizontal bars.
`plot_cd_diagram`(avg_ranks, cd, *[, title, figsize])	Render a Critical Difference diagram (Demšar 2006).
`plot_frequentist_reference_bars`(table, reference, alpha, *)	Render frequentist reference-mode results as horizontal bars.
`plot_performance_profiles`(table, *[, figsize, ...])	Render Dolan-Moré performance profile curves.

Module Contents#

evaluma.plot.plot_aggregate_ranking(table: pandas.DataFrame, *, figsize=None, model_colors=None, title=None, ax=None)#

Render aggregate scores as a horizontal bar chart (no CI whiskers).

Parameters:

table – DataFrame with columns model and score.
figsize – Figure size (width, height) in inches.
model_colors – List of colors, one per model in row order.
title – Optional axes title.
ax – Existing axes to draw into; a new figure is created if None.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

evaluma.plot.plot_iqm_ranking(table: pandas.DataFrame, *, figsize=None, model_colors=None, title=None, ax=None)#

Render IQM scores as a horizontal bar chart with CI error bars.

Parameters:

table – DataFrame with columns model, IQM, CI_low, CI_high as produced by compute_iqm().
figsize – Figure size (width, height) in inches.
model_colors – List of colors, one per model in row order.
title – Optional axes title.
ax – Existing axes to draw into; a new figure is created if None.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

evaluma.plot.plot_bayesian_heatmap(table: pandas.DataFrame, *, title=None, figsize=None, **_kwargs)#

Render Bayesian pairwise probabilities as a matplotlib heatmap.

Each cell (i, j) shows P(model_i > model_j).

Parameters:

table – DataFrame with columns model_a, model_b, p_a_better, p_equiv, p_b_better.
title – Optional figure title.
figsize – Figure size (width, height) in inches.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

evaluma.plot.plot_bayesian_reference_bars(table: pandas.DataFrame, reference: str, *, title=None, figsize=None)#

Render Bayesian comparison against a reference as stacked horizontal bars.

Each bar represents one model compared to the reference. Blue = P(model > reference), grey = P(equivalent), red = P(reference > model). Bars are sorted by P(model > reference) descending.

Parameters:

table – DataFrame with columns model_a, model_b, p_a_better, p_equiv, p_b_better. Expects model_a == reference for all rows (as produced by compute_bayesian() in reference mode).
reference – Name of the reference model.
title – Optional figure title.
figsize – Figure size (width, height) in inches.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

evaluma.plot.plot_cd_diagram(avg_ranks: pandas.Series, cd: float, *, title=None, figsize=None)#

Render a Critical Difference diagram (Demšar 2006).

Models are placed on a horizontal axis by average rank (rank 1 = best on the left). Thick horizontal bars connect cliques of models whose rank gap does not exceed the Nemenyi CD scalar. A CD bracket in the top-right corner shows the critical difference visually.

Parameters:

avg_ranks – Series mapping model names to average rank (lower = better), as produced by compute_frequentist().
cd – Nemenyi critical difference scalar.
title – Optional axes title.
figsize – Figure size (width, height) in inches.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

evaluma.plot.plot_frequentist_reference_bars(table: pandas.DataFrame, reference: str, alpha: float, *, title=None, figsize=None)#

Render frequentist reference-mode results as horizontal bars.

Each bar shows the Holm-corrected p-value for a model vs the reference. A vertical dashed line marks the significance threshold.

Parameters:

table – DataFrame with columns model_a, model_b, p_value_corrected, significant, as produced by compute_frequentist() in reference mode.
reference – Name of the reference model.
alpha – Significance threshold; used to position the dashed line.
title – Optional figure title.
figsize – Figure size (width, height) in inches.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

evaluma.plot.plot_performance_profiles(table: pandas.DataFrame, *, figsize=None, model_colors=None, title=None, ax=None)#

Render Dolan-Moré performance profile curves.

The x-axis uses a native log₁₀ scale with raw τ ratio values (1, 2, 5, 10…), following ML-GYM (Batra et al., 2025) and the AutoML Decathlon (Roberts et al., 2022). τ = 1 means tied for best; τ = 10 means 10× worse than the best model.

Parameters:

table – Long-format DataFrame with columns tau, model, fraction_within_tau.
figsize – Figure size in inches.
model_colors – Dict mapping model names to colors, or a list in model order.
title – Optional axes title.
ax – Existing axes to draw into; a new figure is created if None.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure