Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Deliberate Practice for Synthetic Data Generation (DP) dynamically generates informative samples to improve scaling efficiency, reducing sample requirements and training iterations while achieving superior performance.
Abstract: Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30% reduction in iterations, all while achieving superior performance compared to prior work.
Lay Summary: Modern machine learning models often rely on vast amounts of labeled data, which can be costly or unavailable. A promising alternative is to generate synthetic data using powerful generative models. However, not all synthetic data points are equally useful for training. Our paper introduces a method inspired by how humans learn more effectively from challenging tasks (the principle of deliberate practice). Instead of generating a large synthetic dataset all at once, we iteratively generate only informative and challenging examples based on the model's own uncertainty. This process helps the model to learn more efficiently. We show that this approach improves performance using less data, and we provide theoretical and empirical evidence supporting its benefits.
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Synthetic Data, Deliberate Practice, Active Learning, Sample Efficiency, Scaling Laws, Data Curation, Diffusion Models, Dataset Pruning
Submission Number: 13125
Loading