Statistical Multicriteria Benchmarking via the GSD-Front

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 spotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multicriteria benchmarking, robust statistics, statistical test, imprecise probabilities, reliability, non-standard scales of measurement, decision theory
TL;DR: We propose the GSD-front for reliable multicriteria benchmarking of classifiers, give conditions for its consistent estimability, propose (robust) statistical tests for checking if a classifier is contained, and illustrate it on two benchmark suites.
Abstract: Given the vast number of classifiers that have been (and continue to be) proposed, reliable methods for comparing them are becoming increasingly important. The desire for reliability is broken down into three main aspects: (1) Comparisons should allow for different quality metrics simultaneously. (2) Comparisons should take into account the statistical uncertainty induced by the choice of benchmark suite. (3) The robustness of the comparisons under small deviations in the underlying assumptions should be verifiable. To address (1), we propose to compare classifiers using a generalized stochastic dominance ordering (GSD) and present the GSD-front as an information-efficient alternative to the classical Pareto-front. For (2), we propose a consistent statistical estimator for the GSD-front and construct a statistical test for whether a (potentially new) classifier lies in the GSD-front of a set of state-of-the-art classifiers. For (3), we relax our proposed test using techniques from robust statistics and imprecise probabilities. We illustrate our concepts on the benchmark suite PMLB and on the platform OpenML.
Primary Area: Evaluation (methodology, meta studies, replicability and validity)
Submission Number: 3964
Loading