Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

Joshua Kazdan; Rylan Schaeffer; Apratim Dey; Matthias Gerstgrasser; Rafael Rafailov; David L. Donoho; Sanmi Koyejo

Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

Joshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias Gerstgrasser, Rafael Rafailov, David L. Donoho, Sanmi Koyejo

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: model collapse, model-data feedback loops, synthetic data, sampling bias, deep generative models, model misbehavior

TL;DR: We clarify and unify the fractured literature on the perils and promises of synthetic data in model-data feedback loops

Abstract: The increasing presence of AI-generated content on the internet raises a critical question: What happens when generative machine learning models are pretrained on web-scale datasets containing data created by earlier models? Some authors prophesy _model collapse_ under a '_replace_' scenario: a sequence of models, the first trained with real data and each later one trained _only on_ synthetic data from its preceding model. In this scenario, models successively degrade. Others see collapse as avoidable; in an '_accumulate_' scenario, a sequence of models is trained, but each training uses all real and synthetic data generated so far. In this work, we deepen and extend the study of these contrasting scenarios. First, collapse versus avoidance of collapse is studied by comparing the replace and accumulate scenarios on each of three prominent generative modeling settings; we find the same contrast emerges in all three settings. Second, we study a compromise scenario; the available data remains the same as in the _accumulate_ scenario -- but unlike _accumulate_ and like _replace_, each model is trained using a fixed compute budget; we demonstrate that model test loss on real data is larger than in the _accumulate_ scenario, but apparently plateaus, unlike the divergence seen with _replace_. Third, we study the relative importance of cardinality and proportion of real data for avoiding model collapse. Surprisingly, we find a non-trivial interaction between real and synthetic data, where the value of synthetic data for reducing test loss depends on the absolute quantity of real data. Our insights are particularly important when forecasting whether future frontier generative models will collapse or thrive, and our results open avenues for empirically and mathematically studying the context-dependent value of synthetic data.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7867

Loading