Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Research Area: Data, Societal implications, Science of LMs, LMs and the world
Keywords: model collapse, curse of recursion, generative models, model-data feedback loops
TL;DR: Accumulating Data Prevents Model Collapse
Abstract: The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when future models are trained on model-generated data? Recent investigations answered that such model-data feedback loops cause performance to progressively degrades with each model-data iteration until fitted models become useless, a phenomenon termed model collapse. However, those studies largely assumed that new data replace old data over time, where a more realistic assumption is that data accumulate over time. In this paper, we ask: what effect does accumulating data have on model collapse? We first empirically study this question by pretraining sequences of language models on text corpora. After confirming that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse, we demonstrate that accumulating synthetic data with real data avoids model collapse; these results hold across a range of sizes, architectures, and hyperparameters. We obtain similar results for other deep generative models: diffusion models for molecule conformation generation and variational autoencoders for image generation. To understand why accumulating data can avoid model collapse, we use an analytically tractable framework introduced by prior work in which a sequence of linear models are fit to previous models' outputs. Previous work used this framework to show that if data are replaced, the test error increases with the number of model-fitting iterations; we extend this argument to prove that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations, meaning model collapse is avoided. Our work provides consistent empirical and theoretical evidence that data accumulation avoids model collapse.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 555
Loading