Abstract: This paper argues that model comparison in machine learning can be much improved by using \emph{paired testing}, i.e.\ comparing the predictions of methods A and B on each (common) test example. Due to the limitations of null hypothesis significance testing in frequentist statistics, Bayesian methods are recommended, including the use of the region of practical equivalence (ROPE; Kruschke 2015a; Kruschke and Liddell 2018; Benavoli, Corani, Dem\v{s}ar, and Zaffalon 2017). We discuss a Bayesian $t$-test and a Bayesian McNemar test for
comparisons on a single task, and Bayesian hierarchical models for comparisons over multiple tasks. Two worked examples are presented to illustrate the methods, and the use of reporting guidelines is discussed as a potential means of changing current practice.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Jes_Frellsen1
Submission Number: 8469
Loading