Confirmation: our paper adheres to reproducibility best practices. In particular, we confirm that all important details required to reproduce results are described in the paper,, the authors agree to the paper being made available online through OpenReview under a CC-BY 4.0 license (https://creativecommons.org/licenses/by/4.0/), and, the authors have read and commit to adhering to the AutoML 2025 Code of Conduct (https://2025.automl.cc/code-of-conduct/).
TL;DR: The paper defines and analyzes overtuning - a form of overfitting at the hyperparameter optimization level - showing it occurs relatively frequently yet is often benign but can harm generalization thus warranting greater attention in AutoML.
Abstract: Hyperparameter optimization (HPO) aims to identify an optimal hyperparameter configuration (HPC) such that the resulting model generalizes well to unseen data.
Since directly optimizing the expected generalization error is impossible, resampling techniques like holdout validation or cross-validation are used as proxy measures in HPO.
However, this implicitly assumes that the HPC minimizing validation error will also yield the best true generalization performance.
Given that our inner validation error estimate is inherently stochastic and depends on the resampling, we study:
Can excessive optimization of the validation error lead to a similarly detrimental effect as excessive optimization of the empirical risk of an ML model?
This phenomenon, which we refer to as overtuning, represents a form of overfitting at the HPO level.
Despite its potential impact, overtuning has received limited attention in the HPO and automated machine learning (AutoML) literature.
We first formally define overtuning and distinguish it from related concepts such as meta-overfitting.
We then reanalyze large-scale HPO benchmark data, assessing how frequently overtuning occurs and its practical relevance.
Our findings suggest that overtuning is more common than expected, although often mild.
However, in 10% of cases, severe overtuning results in selecting an HPC whose generalization performance is worse than the default HPC.
We further examine how factors such as the chosen performance metric, resampling method, dataset size, learning algorithm, and optimization strategy influence overtuning and discuss potential mitigation strategies.
Our results highlight the need to raise awareness of overtuning, particularly in the small-data regime, indicating that further mitigation strategies should be studied.
Submission Number: 6
Loading