Keywords: Regression-Stratified Sampling, Tabular AutoML, Algorithm Selection, Time-Constrained AutoML, Probability Density Function (PDF)
TL;DR: We introduce a Regression-Stratified Sampling method with a PDF Energy metric for selecting optimized machine learning algorithms in AutoML for tabular data, outperforming existing techniques across diverse datasets and AutoML tools.
Abstract: The selection of a machine-learning (ML) algorithm is indispensable for tabular AutoML training. Finding an optimized algorithm from a search space can be expensive for large tabular datasets, especially under time constraints. In this study, we introduce a novel Regression-Stratified Sampling approach that optimizes algorithm selection by minimizing distribution distance between a subset of data and the target variable(s) in the full-scale dataset via Probability Density Function (PDF). Additionally, we introduce a PDF Energy metric, based on relative entropy, to identify an optimized ML algorithm from the search space.
Our comprehensive evaluation results demonstrate that the proposed approach successfully selects optimized algorithms from a search space of atomic and ensemble models, outperforming simple random sampling methods. We also conduct a thorough evaluation against Kullback-Leibler (KL) divergence, where the PDF Energy metric proves superior in algorithm selection.
Furthermore, we validate our approach for ML algorithm selection in an end-to-end scenario across 31 public datasets using 6 tabular AutoML tools. The empirical results indicate that our proposed method efficiently utilizes Regression-Stratified Sampling and reliably identifies an optimized machine learning algorithm for tabular data through the PDF Energy metric under time constraints.
Submission Number: 103
Loading