Self-Verification Provably Prevents Model Collapse in Recursive Synthetic Training

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Model Collapse, Learning Theory, Synthetic Data, Self-Verification, LLMs
TL;DR: This paper proves that self-verification prevents model collapse in recursive training without relying on real data.
Abstract: Large generative models are increasingly trained on synthetic data from earlier generations, raising concerns about *model collapse*, a progressive performance decline consistently observed in empirical studies. However, theoretical understanding of recursive training dynamics and their failure modes remains limited. In this work, we theoretically show that recursive training inherently leads to exponential error growth unless mitigated by sufficient real data. Addressing the growing scarcity of real data, we introduce a self-verification mechanism enabling models to filter their outputs based on internal confidence scores without external validation. Through rigorous analysis, we derive finite-sample error bounds demonstrating that self-verification alone can prevent collapse, even in fully synthetic training regimes. Our theoretical framework extends to large language models (LLMs), characterizing the conditions under which recursive training can maintain stability without performance degradation.
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 12388
Loading