Oversampling Tabular Data with Deep Generative Models: Is it worth the effort?Download PDF

Published: 09 Dec 2020, Last Modified: 05 May 2023ICBINB 2020 SpotlightReaders: Everyone
Keywords: oversampling, class imbalance, deep generative models, machine learning, deep learning
TL;DR: Is it worth to use deep generative models for oversampling tabular data?
Abstract: In practice, machine learning experts are often confronted with imbalanced data. Without accounting for the imbalance, common classifiers perform poorly, and standard evaluation metrics mislead the practitioners on the model's performance. A standard method to treat imbalanced datasets is under- and oversampling. In this process, samples are removed from the majority class, or synthetic samples are added to the minority class. In this paper, we follow up on recent developments in deep learning. We take proposals of deep generative models and study these approaches' ability to provide realistic samples that improve performance on imbalanced classification tasks via oversampling. Across 160K+ experiments, we show that the improvements in terms of performance metric, while shown to be significant when ranking the methods like in the literature, often are minor in absolute terms, especially compared to the required effort. Furthermore, we notice that a large part of the improvement is due to undersampling, not oversampling.
1 Reply

Loading