The Digital Inbreeding Crisis: Empirical Evidence of LLM Capability Degradation under Multi-Generational Synthetic Training
Keywords: Large Language Models, Synthetic Data, Model Collapse, Capability Degradation, AI Safety
TL;DR: An investigation into the potential deterioration of language model capabilities when training new models exclusively on outputs from existing LLMs, examining the parallel to biological inbreeding depression.
Abstract: This paper provides the first comprehensive empirical validation of the ``digital inbreeding'' hypothesis—measurable capability degradation when LLMs are trained iteratively on synthetic data. Through systematic experimental analysis across three generations and multiple evaluation domains, we demonstrate 4.54\% F1 decline in mixed training conditions versus 3.43\% improvement in controls using exclusively human data. Our multi-dimensional analysis reveals semantic coherence decline (-6.05\%), structural simplification (-17.8\% sentence length reduction), and compensatory diversification (+34.3\% distinct n-gram increase). These findings establish quantifiable evidence for model collapse effects in production scenarios, providing actionable guidelines for training data curation and sustainable AI development.
Supplementary Material: zip
Submission Number: 257
Loading