Keywords: generative model, representation learning, synthetic data
Abstract: Foundation models have achieved significant advancements across various domains, yet their training demands vast amounts of real-world data, which is becoming increasingly scarce. To address this challenge, synthetic data has garnered substantial interest as an alternative for augmenting training datasets in fields such as computer vision and natural language processing. However, skepticism remains regarding whether synthetic classifiers can match the performance of those trained on real data. In this paper, we investigate this question by conducting a detailed analysis within the realm of visual tasks, comparing classifiers trained on synthetic versus real data using CLIP and ViT. Our results reveal that synthetic classifiers exhibit deficiencies in a range of challenging real-world scenarios, such as fine-grained classification, extreme object scales and extreme brightness despite achieving comparable overall accuracy to their real-data-trained counterparts. We find that the limitations of synthetic classifiers can be traced back to the limitations of current generative models in capturing the complexity and diversity of real-world data in these aspects. To mitigate these issues efficiently, we explore \textbf{RealTune}, a simple method that enhances synthetic classifiers by finetuning them with a small amount of real data. Experimental evaluations demonstrate that RealTune significantly improves the performance of synthetic classifiers using only a limited real dataset (e.g., 40k images, 3% of ImageNet) with minimal training time (e.g., 1hour on a single NVIDIA RTX 3090 GPU). Our findings indicate that while synthetic data is a valuable resource, integrating real and synthetic data is essential to achieve robust and efficient classifiers. This work underscores the necessity of leveraging both data types to bridge the performance gap and enhance the overall effectiveness of foundation models.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8604
Loading