evaluma.plot#

Functions#

plot_aggregate_ranking(table, *[, figsize, ...])

Render aggregate scores as a horizontal bar chart (no CI whiskers).

plot_iqm_ranking(table, *[, figsize, model_colors, ...])

Render IQM scores as a horizontal bar chart with CI error bars.

plot_bayesian_heatmap(table, *[, title, figsize])

Render Bayesian pairwise probabilities as a matplotlib heatmap.

plot_bayesian_reference_bars(table, reference, *[, ...])

Render Bayesian comparison against a reference as stacked horizontal bars.

plot_cd_diagram(avg_ranks, cd, *[, title, figsize])

Render a Critical Difference diagram (Demšar 2006).

plot_frequentist_reference_bars(table, reference, alpha, *)

Render frequentist reference-mode results as horizontal bars.

plot_performance_profiles(table, *[, figsize, ...])

Render Dolan-Moré performance profile curves.

Module Contents#

evaluma.plot.plot_aggregate_ranking(table: pandas.DataFrame, *, figsize=None, model_colors=None, title=None, ax=None)#

Render aggregate scores as a horizontal bar chart (no CI whiskers).

Parameters:
  • table – DataFrame with columns model and score.

  • figsize – Figure size (width, height) in inches.

  • model_colors – List of colors, one per model in row order.

  • title – Optional axes title.

  • ax – Existing axes to draw into; a new figure is created if None.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

evaluma.plot.plot_iqm_ranking(table: pandas.DataFrame, *, figsize=None, model_colors=None, title=None, ax=None)#

Render IQM scores as a horizontal bar chart with CI error bars.

Parameters:
  • table – DataFrame with columns model, IQM, CI_low, CI_high as produced by compute_iqm().

  • figsize – Figure size (width, height) in inches.

  • model_colors – List of colors, one per model in row order.

  • title – Optional axes title.

  • ax – Existing axes to draw into; a new figure is created if None.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

evaluma.plot.plot_bayesian_heatmap(table: pandas.DataFrame, *, title=None, figsize=None, **_kwargs)#

Render Bayesian pairwise probabilities as a matplotlib heatmap.

Each cell (i, j) shows P(model_i > model_j).

Parameters:
  • table – DataFrame with columns model_a, model_b, p_a_better, p_equiv, p_b_better.

  • title – Optional figure title.

  • figsize – Figure size (width, height) in inches.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

evaluma.plot.plot_bayesian_reference_bars(table: pandas.DataFrame, reference: str, *, title=None, figsize=None)#

Render Bayesian comparison against a reference as stacked horizontal bars.

Each bar represents one model compared to the reference. Blue = P(model > reference), grey = P(equivalent), red = P(reference > model). Bars are sorted by P(model > reference) descending.

Parameters:
  • table – DataFrame with columns model_a, model_b, p_a_better, p_equiv, p_b_better. Expects model_a == reference for all rows (as produced by compute_bayesian() in reference mode).

  • reference – Name of the reference model.

  • title – Optional figure title.

  • figsize – Figure size (width, height) in inches.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

evaluma.plot.plot_cd_diagram(avg_ranks: pandas.Series, cd: float, *, title=None, figsize=None)#

Render a Critical Difference diagram (Demšar 2006).

Models are placed on a horizontal axis by average rank (rank 1 = best on the left). Thick horizontal bars connect cliques of models whose rank gap does not exceed the Nemenyi CD scalar. A CD bracket in the top-right corner shows the critical difference visually.

Parameters:
  • avg_ranks – Series mapping model names to average rank (lower = better), as produced by compute_frequentist().

  • cd – Nemenyi critical difference scalar.

  • title – Optional axes title.

  • figsize – Figure size (width, height) in inches.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

evaluma.plot.plot_frequentist_reference_bars(table: pandas.DataFrame, reference: str, alpha: float, *, title=None, figsize=None)#

Render frequentist reference-mode results as horizontal bars.

Each bar shows the Holm-corrected p-value for a model vs the reference. A vertical dashed line marks the significance threshold.

Parameters:
  • table – DataFrame with columns model_a, model_b, p_value_corrected, significant, as produced by compute_frequentist() in reference mode.

  • reference – Name of the reference model.

  • alpha – Significance threshold; used to position the dashed line.

  • title – Optional figure title.

  • figsize – Figure size (width, height) in inches.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure

evaluma.plot.plot_performance_profiles(table: pandas.DataFrame, *, figsize=None, model_colors=None, title=None, ax=None)#

Render Dolan-Moré performance profile curves.

The x-axis uses a native log₁₀ scale with raw τ ratio values (1, 2, 5, 10…), following ML-GYM (Batra et al., 2025) and the AutoML Decathlon (Roberts et al., 2022). τ = 1 means tied for best; τ = 10 means 10× worse than the best model.

Parameters:
  • table – Long-format DataFrame with columns tau, model, fraction_within_tau.

  • figsize – Figure size in inches.

  • model_colors – Dict mapping model names to colors, or a list in model order.

  • title – Optional axes title.

  • ax – Existing axes to draw into; a new figure is created if None.

Returns:

The rendered figure.

Return type:

matplotlib.figure.Figure