Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data
Keywords: Model Collapse, Model-Data Feedback Loops, Generative Models, Language Models, Diffusion Models, Variational Autoencoders
Abstract: The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed _model collapse_, under which performance progressively degrades with each model-data feedback iteration until fitted models become useless. However, those studies largely assumed that new data replace old data over time, where an arguably more realistic assumption is that data accumulate over time. In this paper, we ask: what effect does accumulating data have on model collapse? We empirically study this question by pretraining sequences of deep generative models (language models, diffusion models, variational autoencoders) on different tasks (causal language modeling, molecular conformation, image generation). After confirming that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse, we discover that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse. To understand why accumulating data can avoid model collapse, we use an analytically tractable framework of linear models introduced by prior work which showed replacing causes the test error to diverge; we extend their analysis to prove that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations, meaning model collapse no longer occurs. Our work provides consistent empirical and theoretical evidence that not discarding real data avoids model collapse.
Submission Number: 4
Loading