CART-Based Synthetic Tabular Data Generation for Imbalanced Regression

ICLR 2026 Conference Submission22132 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Imbalanced Regression, Data-level Methods, Sampling, Synthetic Data Generation, Extreme Value Prediction
Abstract: Handling imbalanced target distributions in regression tasks remains a significant challenge in tabular data settings where the underrepresentation of relevant regions can hinder model performance. Among data-level solutions, some proposals, such as random sampling and SMOTE-based approaches, propose adapting classification techniques to regression tasks. However, these methods typically rely on crisp, artificial thresholds over the target variable, a limitation inherited from classification settings that can introduce arbitrariness, often leading to non-intuitive and potentially misleading problem formulations. While recent generative models, such as GANs and VAEs, provide flexible sample synthesis, they come with high computational costs and limited interpretability. In this study, we propose adapting an existing CART-based synthetic data generation method, tailoring it for imbalanced regression. The new method integrates relevance and density-based mechanisms to guide sampling in sparse regions of the target space and employs a threshold-free, feature-driven generation process, making it suitable for heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. Our experimental study focuses on the prediction of extreme target values across benchmark datasets. The results indicate that the proposed method is competitive with other resampling and generative strategies in terms of performance, while offering faster execution and greater transparency, providing the best trade-off between both aspects. These results highlight the method’s potential as a transparent, scalable data-level strategy for improving regression models in imbalanced domains.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 22132
Loading