Keywords: model collapse, framework, synthetic data
TL;DR: We provide a framework for studying model collapse in LLMs and use it to conduct the first evaluation in large-scale continual pretraining
Abstract: The abundance of content generated by Large Language Models (LLM) at web scale creates a risk of "model collapse", a phenomenon where recursive training on synthetic data negatively affects future generations of models. However, empirical studies of the existence and severity of model collapse remain fragmented due to a lack of standardized tooling, an inconsistent scope of empirical evaluation, and varying measures of collapse. To bridge this gap, we introduce $\texttt{collapsebench}$, an ongoing development of a framework designed for principled, reproducible study of model collapse at scale. The framework automates the full recursive pipeline of data curation, training, and generation, offering support for various design choices from prior work via a unified configuration-as-code interface. We aim to devise a modular and customizable testbed for the community to rigorously evaluate model collapse and to propose and study new measures and mitigation strategies. We showcase various functionalities of the framework through supervised finetuning experiments. Using $\texttt{collapsebench}$, we contribute the first evaluation of the severity of model collapse in a realistic setting of continual pretraining (CPT). In particular, we observe a consistent drop in accuracy during iterative training in high-synthetic-data regimes. We also explore the impact of realistic synthetic data curation, observing partial dampening of the effect of model collapse.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 68
Loading