evaluma#
ML benchmark ranking tools — Interquartile mean, Bayesian pairwise comparison, and Dolan-Moré performance profiles in a single Python API and CLI.
IQM Ranking
Bootstrap confidence intervals around the interquartile mean, following best practices from deep RL research.
Bayesian Comparison
Posterior probabilities that one model beats another (or is practically equivalent) across multiple datasets.
Performance Profiles
Dolan-Moré cumulative profiles showing what fraction of benchmarks a model solves within a given performance ratio.
Quick links#
Overview & Quickstart — Install and run your first analysis
Configuration Guide — Seeds, metric bounds, and incomplete data
Tutorials — Deep dives into each method
API Reference — Full module documentation
Contributing — Set up a dev environment