Keywords: High-dimensional Linear Regression, SGD, Scaling Laws, Synthetic Data
Abstract: Synthetic data has become a promising way to scale model training beyond limited human-generated data but it may also induce strong model collapse (Dohmatob et al., 2024), where any fixed fraction of synthetic data prevents model performance from improving under data scaling, leaving a non-vanishing excess risk floor. In this paper, we studied how synthetic data affects the generalization of one-pass SGD in high-dimensional linear regression with model shift. We show that mixed training induces strong model collapse while two-stage training avoids this by using synthetic data only in the first stage, followed by real-data training in the second stage, showing that strong model collapse is not inevitable through a simple data curriculum. We further establish scaling-law upper bounds for both protocols under a random sketch model, showing that larger models amplify synthetic-induced degradation in mixed training and giving an explicit characterization of how high-quality synthetic training may reduce bias in two-stage training. Overall, our results highlight that synthetic data is neither inherently harmful nor beneficial; its effect depends critically on both its quality and the training protocol used to incorporate it.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 37
Loading