Collapse or Thrive: Perils and Promises of Synthetic Data in a Self-Generating World

Joshua Kazdan; Rylan Schaeffer; Apratim Dey; Matthias Gerstgrasser; Rafael Rafailov; David L. Donoho; Sanmi Koyejo

Collapse or Thrive: Perils and Promises of Synthetic Data in a Self-Generating World

Joshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias Gerstgrasser, Rafael Rafailov, David L. Donoho, Sanmi Koyejo

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: forecasting whether future frontier generative models will collapse or thrive due to synthetic data

Abstract: What happens when generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models? Some prior work warns of “model collapse” as the web is overwhelmed by synthetic data; other work suggests the problem can be contained (i.e. collapse can be avoided) by managing how available data are used in pretraining. In this paper, we report experiments on three ways of using data (training-workflows), across three generative model task-settings (multivariate Gaussian estimation, kernel density estimation, and language-model fine-tuning) to further confirm the possibility of containment: (a) we confirm that the training-workflow of {\it replacing} all real data by successive generations of purely synthetic data indeed suffers model collapse in all task-settings studied; (b) we consider the training-workflow of {\it accumulating} synthetic data alongside real data and training on all data combined and confirming that, although the proportion of real data eventually becomes zero, models remain stable and their test losses do not diverge under this training-workflow; (c) we consider a training-workflow where real and synthetic data accumulate together but successive generations of pretraining are constrained to use fixed-size data subsets each generation. In this workflow, we observe slow and gradual rather than explosive degradation of test loss performance across generations. Our insights are particularly important when forecasting whether future frontier generative models will collapse or thrive, and our results open avenues for empirically and mathematically studying the context-dependent value of synthetic data.

Lay Summary: Some previous work has claimed that ai-generated content on the internet will cause future models trained on this data to degrade in quality, or exhibit "model collapse". We refute prior work that promotes model collapse as a major threat to future models by showing that as long as models are trained on a mixture of real and synthetic data, model collapse is contained. Since no one will delete human data en masse from the internet, model collapse is unlikely to constrain future generations of AI models.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/RylanSchaeffer/KoyejoLab-Collapse-or-Thrive

Primary Area: Deep Learning->Large Language Models

Keywords: model collapse, synthetic data, model-data feedback loops, data-model feedback loops, generative models, generative modeling, kernel density estimation, supervised finetuning

Submission Number: 2706

Loading