Mousterian: exploring the equivalence of generative and real data augmentation in classification

Haowen Wang; Guowei Zhang; Xiang Zhang; Zeyuan Chen; Haiyang Xu; Dou Hoon Kwark; Zhuowen Tu

Mousterian: exploring the equivalence of generative and real data augmentation in classification

Haowen Wang, Guowei Zhang, Xiang Zhang, Zeyuan Chen, Haiyang Xu, Dou Hoon Kwark, Zhuowen Tu

27 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: analysis by synthesis, image classification, data generation

TL;DR: This paper investigates the effectiveness of generative data augmentation in image classification, revealing that both internal and external data can significantly improve performance, with empirical guidelines established for using synthetic data.

Abstract: In this paper, we address a key question in machine learning: **How effectively can generative data augmentation enhance image classification?** We begin by examining the differences and similarities between real and synthetic data generated by advanced text-to-image models. Through comprehensive experiments, we provide systematic insights into leveraging synthetic data for improved classification performance. Our findings show that: 1). Generative data augmentation by models trained solely on the internal (available training) set can effectively improve classification performance, validating the long-held hypothesis that synthesis enhances analysis by enriching modeling capability. 2). For generative data augmentation by models trained on both internal and external data (e.g. large-scale image-text pairs) separately, the size of equivalent synthetic dataset augmentation can be determined empirically. In addition to being aligned with a common intuition that real data augmentation is always preferred, our empirical formulation also provides a guideline for quantitatively estimating how much larger the size of generative dataset augmentation is, over the real data augmentation, to achieve comparable improvements. Our CIFAR-10 and ImageNet results also demonstrate its impact w.r.t. the size of the baseline training set and the quality of generative models.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9246

Loading