AutoGenDA: Automated Generative Data Augmentation for Imbalanced Classifications

TMLR Paper5038 Authors

05 Jun 2025 (modified: 08 Jul 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Data augmentation is an approach to increasing the training dataset size for deep learning using synthetic data. Recent advancements in image generative models have unleashed the potential of synthesizing high-quality images in data augmentation. However, real-life datasets commonly follow an imbalanced class distribution, where some classes have fewer samples than others. Image generation models may, therefore, struggle to synthesize diverse images for less common classes that lack richness and diversity. To address this, we introduce an automated generative data augmentation method, AutoGenDA, to extract and transfer label-invariant changes across data classes through image captions and text-guided generative models. We also propose an automated search strategy to optimize the data augmentation process for each data class, leading to better generalization. Our experiments demonstrate the effectiveness of AutoGenDA in various object classification datasets. We improve the standard data augmentation baselines by up to 4.9\% on Pascal VOC, Caltech101, MS-COCO, and LVIS under multiple imbalanced classification settings.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jaesik_Park3
Submission Number: 5038
Loading