Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

Zhuoran Yu; Chenchen Zhu; Sean Culatana; Raghuraman Krishnamoorthi; Fanyi Xiao; Yong Jae Lee

Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

Zhuoran Yu, Chenchen Zhu, Sean Culatana, Raghuraman Krishnamoorthi, Fanyi Xiao, Yong Jae Lee

Published: 21 Jan 2025, Last Modified: 21 Jan 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. While prior research indicates that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images can boost an ImageNet classifier's performance, when synthetic images start to outnumber real ones in training, the classifier performance starts to degrade, underscoring the scalability challenge of training with synthetic data. In this paper, we delve into the necessity of generative fine-tuning for achieving recognition performance improvements and investigate the scalability of training with large-scale synthetic images. We find that leveraging off-the-shelf generative models without fine-tuning, while addressing challenges of class name ambiguity, limited prompt diversity, and domain shifts effectively mitigates performance degradation from large-scale synthetic data. Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images. Our framework consistently boosts recognition model performance with increased synthetic data, even up to 6 times the original ImageNet size. Models trained with our approach demonstrate significant in-domain improvement on ImageNet-val (1.20\% to 2.35\% across various architectures) and strong out-of-domain generalization on ImageNet-Sketch and -Rendition ($\sim$10\% improvement with large vision transformers).

Certifications: Featured Certification

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We revised our manuscript based on reviewer's feedback: 1. We reworded the statement on scalability experiment by specifying that "outnumbering" refers to the total quantity of unique synthetic images exceeding real ones, while balanced sampling is applied to each training batch. 2. We added an illustrative figure for label ambiguity resolution module

Assigned Action Editor: ~Jia-Bin_Huang1

Submission Number: 3017

Loading