evaluma

Contents

evaluma#

ML benchmark ranking tools — Interquartile mean, Bayesian pairwise comparison, and Dolan-Moré performance profiles in a single Python API and CLI.

IQM Ranking

Bootstrap confidence intervals around the interquartile mean, following best practices from deep RL research.

Bayesian Comparison

Posterior probabilities that one model beats another (or is practically equivalent) across multiple datasets.

Performance Profiles

Dolan-Moré cumulative profiles showing what fraction of benchmarks a model solves within a given performance ratio.