Use Bayesian Paired Tests to Improve the Comparison of Machine Learning Models

Use Bayesian Paired Tests to Improve the Comparison of Machine Learning Models

16 Apr 2026 (modified: 23 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper argues that model comparison in machine learning can be much improved by using \emph{paired testing}, i.e.\ comparing the predictions of methods A and B on each (common) test example. Due to the limitations of null hypothesis significance testing in frequentist statistics, Bayesian methods are recommended, including the use of the region of practical equivalence (ROPE; Kruschke 2015a; Kruschke and Liddell 2018; Benavoli, Corani, Dem\v{s}ar, and Zaffalon 2017). We discuss a Bayesian $t$-test and a Bayesian McNemar test for comparisons on a single task, and Bayesian hierarchical models for comparisons over multiple tasks. Two worked examples are presented to illustrate the methods, and the use of reporting guidelines is discussed as a potential means of changing current practice.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Jes_Frellsen1

Submission Number: 8469

Loading