Generalized Oversampling for Learning from Imbalanced datasets and Associated Theory: Application in Regression

Published: 19 Jun 2024, Last Modified: 19 Jun 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: In supervised learning, it is quite frequent to be confronted with real imbalanced datasets. This situation leads to a learning difficulty for standard algorithms. Research and solutions in imbalanced learning have mainly focused on classification tasks. Despite its importance, very few solutions exist for imbalanced regression. In this paper, we propose a data augmentation procedure, the GOLIATH algorithm, based on kernel density estimates and especially dedicated to the problem of imbalanced data. This general approach encompasses two large families of synthetic oversampling: those based on perturbations, such as Gaussian Noise, and those based on interpolations, such as SMOTE. It also provides an explicit form of such machine learning algorithms. New synthetic data generators are deduced. We apply GOLIATH in imbalanced regression combining such generator procedures with a new wild-bootstrap resampling technique for the target values. We evaluate the performance of the GOLIATH algorithm in imbalanced regression where we compare our approach with state-of-the-art techniques.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Taking into account the reviewers' comments and adding "Impact analysis of parameters" parts
Supplementary Material: pdf
Assigned Action Editor: ~Gintare_Karolina_Dziugaite1
Submission Number: 2290