evaluma.cli#

Functions#

_parse_metric_direction(ctx, param, value)

Parse KEY:min / KEY:max tokens into a metric-direction dict.

_common_options(f)

Attach shared CLI options to a Click command.

_load_bench(csv_path, model, dataset, metric, score, ...)

Load a CSV and return a normalized Benchmark, merging CLI args with config.

_save(result, stem, output_dir)

Serialize a result to CSV and PNG inside output_dir.

main()

evaluma — ML benchmark evaluation tools.

report(csv_path, model, dataset, metric, score, ...)

Run all three analyses and write results to --output.

rank(csv_path, model, dataset, metric, score, ...)

Compute IQM rankings (requires seed column) and write iqm_ranking.{csv,png}.

aggregate(csv_path, model, dataset, metric, score, ...)

Compute point-estimate aggregate ranking and write aggregate_ranking.csv/png.

compare(csv_path, model, dataset, metric, score, ...)

Compute Bayesian pairwise comparisons and write results.

frequentist(csv_path, model, dataset, metric, score, ...)

Compute Friedman + Nemenyi (all-pairs) or Wilcoxon + Holm (reference) comparison.

profiles(csv_path, model, dataset, metric, score, ...)

Compute Dolan-Moré performance profiles and write results.

Module Contents#

evaluma.cli._parse_metric_direction(ctx, param, value)#

Parse KEY:min / KEY:max tokens into a metric-direction dict.

Parameters:
  • ctx – Click context (unused; required by the callback protocol).

  • param – Click parameter (unused).

  • value – Tuple of strings, each formatted as "KEY:min" or "KEY:max".

Returns:

Mapping from dataset name to "min" or "max", or None when value is empty.

Return type:

dict | None

Raises:

click.BadParameter – If a token is malformed or the direction is not "min" or "max".

evaluma.cli._common_options(f)#

Attach shared CLI options to a Click command.

evaluma.cli._load_bench(csv_path, model, dataset, metric, score, config_path, metric_direction, output_dir, seed=None)#

Load a CSV and return a normalized Benchmark, merging CLI args with config.

Parameters:
  • csv_path – Path to the input CSV file.

  • model – CLI value for the model column name.

  • dataset – CLI value for the dataset column name.

  • metric – CLI value for the metric column name.

  • score – CLI value for the score column name.

  • config_path – Optional path to a YAML config file.

  • metric_direction – Parsed metric-direction dict (or None).

  • output_dir – Path to the output directory (created if absent).

  • seed – Optional column name for the random seed.

Returns:

Loaded and normalized benchmark.

Return type:

Benchmark

evaluma.cli._save(result, stem, output_dir)#

Serialize a result to CSV and PNG inside output_dir.

evaluma.cli.main()#

evaluma — ML benchmark evaluation tools.

evaluma.cli.report(csv_path, model, dataset, metric, score, config_path, metric_direction, output_dir)#

Run all three analyses and write results to --output.

evaluma.cli.rank(csv_path, model, dataset, metric, score, config_path, metric_direction, output_dir, seed)#

Compute IQM rankings (requires seed column) and write iqm_ranking.{csv,png}.

evaluma.cli.aggregate(csv_path, model, dataset, metric, score, config_path, metric_direction, output_dir, agg)#

Compute point-estimate aggregate ranking and write aggregate_ranking.csv/png.

evaluma.cli.compare(csv_path, model, dataset, metric, score, config_path, metric_direction, output_dir)#

Compute Bayesian pairwise comparisons and write results.

evaluma.cli.frequentist(csv_path, model, dataset, metric, score, config_path, metric_direction, output_dir, reference, alpha)#

Compute Friedman + Nemenyi (all-pairs) or Wilcoxon + Holm (reference) comparison.

evaluma.cli.profiles(csv_path, model, dataset, metric, score, config_path, metric_direction, output_dir)#

Compute Dolan-Moré performance profiles and write results.