evaluma

evaluma#

ML benchmark ranking tools — Interquartile mean, Bayesian pairwise comparison, and Dolan-Moré performance profiles in a single Python API and CLI.

IQM Ranking

Bootstrap confidence intervals around the interquartile mean, following best practices from deep RL research.

Bayesian Comparison

Posterior probabilities that one model beats another (or is practically equivalent) across multiple datasets.

Performance Profiles

Dolan-Moré cumulative profiles showing what fraction of benchmarks a model solves within a given performance ratio.

Quick links#

Overview & Quickstart — Install and run your first analysis
Configuration Guide — Seeds, metric bounds, and incomplete data
Tutorials — Deep dives into each method
API Reference — Full module documentation
Contributing — Set up a dev environment

evaluma

Contents

evaluma#

Quick links#