A Consequentialist Critique of Binary Classification Evaluation Practices: Theory, Practice, and Tools

Published: 03 Feb 2026, Last Modified: 03 Feb 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We study binary classifier metric usage, propose rules for choosing a metric, propose clipped brier scores for a common underserved setting, test it on breast cancer data, and provide a package to do this analysis.
Abstract: Machine learning–supported decisions, such as ordering diagnostic tests or determining preventive custody, often rely on binary classification from probabilistic forecasts. A consequentialist perspective, long emphasized in decision theory, favors evaluation methods that reflect the quality of such forecasts under threshold uncertainty and varying prevalence, notably Brier scores and log loss. However, our empirical review of practices at major ML venues (ICML, FAccT, CHIL) reveals a dominant reliance on top-K metrics or fixed-threshold evaluations. To address this disconnect, we introduce a decision-theoretic framework mapping evaluation metrics to their appropriate use cases, along with a practical Python package, \texttt{briertools}, designed to make proper scoring rules more usable in real-world settings. Specifically, we implement a clipped variant of the Brier score that avoids full integration and better reflects bounded, interpretable threshold ranges. We further contribute a theoretical reconciliation between the Brier score and decision curve analysis, directly addressing a longstanding critique by Assel et al (2017) regarding the clinical utility of proper scoring rules.
Submission Number: 222
Loading