Data Augmentation's Effect on Machine Learning Models when Learning with Imbalanced Data

Damien A. Dablain, Nitesh V. Chawla

Published: 2024, Last Modified: 20 Jul 2025DSAA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Real-world data is often imbalanced, such that the number of training instances varies by class. Data augmentation (DA) of under-represented classes is commonly used to improve model generalization in the face of class imbalance. Despite its ubiquity, the impact of data augmentation on machine learning (ML) models is not clearly understood. Here, we undertake a holistic examination of the effect of DA on under-represented classes. Unlike other studies, which focus on a single ML model type, we examine three different classifier families: convolutional neural networks, support vector machines, and logistic regression models; five different DA techniques and two different data modalities - image and tabular. Our research indicates that DA, when applied to imbalanced data, produces substantial changes in model weights, support vectors and front-end feature selection. These changes occur with respect to all classes, not just the ones that DA is applied to. Further, our empirical analysis shows that data augmentation's positive influence on generalization does not necessarily occur as a result of reducing weight norms. Rather, weight and support vector specialization play important roles in generalization. The specialization process may be a form of memorization that is spawned by variances introduced by augmented data. We investigate the seeming contradiction between improved generalization versus weight and support vector specialization.