Hyperparameter search on the test set in the wild

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: EEG
TL;DR: We demonstrate that a number of recent papers on EEG classification overestimate classification accuracy by doing hyperparameter search on the test set.
Abstract: Systems neuroscience has rapidly adopted machine-learning techniques, but has yet to develop a robust standardized methodology for assessing the performance of decoding models. Methodological issues can sometimes be subtle, arising as a consequence of experimental design. Here, in contrast, we investigate the consequences of post-hoc model selection: an issue which is neither subtle nor idiosyncratic. This occurs when a single test set is used to both select hyperparameters and evaluate performance, which favors models that overfit to ungeneralizable features of the test set. While the issues with this practice have been well documented within the ML literature, it has seen continued use in several domains, including systems neuroscience. To highlight this unfortunate practice, we performed a series of experiments using a selection of models from affected EEG decoding studies, finding that the overestimation of decoding accuracy in the affected studies was substantial: ranging from 0.74-1.24%. Moreover, we demonstrate that post-hoc model selection favors unstable model architectures, as the variability in their performance increases the likelihood that an instance of the model will coincidentally match the test set. Comparisons of model performance under post-hoc model selection may thus mislead researchers into developing increasingly complex and unstable models which fail to outperform simpler, more stable, ones.
Primary Area: applications to neuroscience & cognitive science
Submission Number: 4806
Loading