Abstract: While current machine learning hyperparameter tuning methods have been thoroughly tested and show consistently high performance in large datasets, few studies have made efforts to rigorously assess their performance in small data regimes. Studies that have examined hyperparameter optimization in small datasets have found reduced generalization performance, poor correlation between validation and test error, and overly optimistic error estimates associated with the chosen hyperparameter. This has been observed across different hyperparameter optimization methods, including grid search and Bayesian optimization. We implement design of experiments principles to mitigate the bias between validation error and generalizable test error when hyperparameter tuning. Specifically, we utilize a surface fitted to a space-filling design on the hyperparameter space to generate optimal hyperparameter sets. Using fourteen publicly available datasets and repeated experiments via Monte Carlo simulation, we show that this method has similar generalizable test error compared to both grid search and Bayesian Optimization, but the bias between the validation error and generalizable test error is drastically reduced by 80-96\% compared to both methods. As a secondary outcome of this work, we find that none of the evaluated methods of hyperparameter optimization offer consistent improvement over untuned models in our experiments, raising questions about the general efficacy of hyperparameter optimization in small sample regimes.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Shangqian_Gao1
Submission Number: 7811
Loading