Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

TMLR Paper3017 Authors

17 Jul 2024 (modified: 28 Oct 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. While prior research indicates that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images can boost an ImageNet classifier's performance, when synthetic images start to outnumber real ones in training, the classifier performance starts to degrade, underscoring the scalability challenge of training with synthetic data. In this paper, we delve into the necessity of generative fine-tuning for achieving recognition performance improvements and investigate the scalability of training with large-scale synthetic images. We find that leveraging off-the-shelf generative models without fine-tuning, while addressing challenges of class name ambiguity, limited prompt diversity, and domain shifts effectively mitigates performance degradation from large-scale synthetic data. Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images. Our framework consistently boosts recognition model performance with increased synthetic data, even up to 6 times the original ImageNet size. Models trained with our approach demonstrate significant in-domain improvement on ImageNet-val (1.20\% to 2.35\% across various architectures) and strong out-of-domain generalization on ImageNet-Sketch and -Rendition ($\sim$10\% improvement with large vision transformers).
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We revised our manuscript based on reviewer's feedback: 1. We reworded the statement on scalability experiment by specifying that "outnumbering" refers to the total quantity of unique synthetic images exceeding real ones, while balanced sampling is applied to each training batch. 2. We added an illustrative figure for label ambiguity resolution module
Assigned Action Editor: ~Jia-Bin_Huang1
Submission Number: 3017
Loading