Keywords: Large Language Model, Pretraining, Synthetic Data
TL;DR: We propose RePro, a novel web recycling method that trains a language model with RL to perform effective and faithful rephrasing. It outperforms state-of-the-art recycling method using a 17× larger model and improves organic data efficiency by 2-3×.
Abstract: High-quality data is a cornerstone of large language model (LLM) pretraining, yet its growth has not kept pace with the needs of frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one *quality* reward and three *faithfulness* rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7\%-14.0\% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4$\times$ larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3$\times$. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness organic pretraining data. Our anonymized code is available at https://anonymous.4open.science/r/RePro. We will open-source our rephraser and recycled data.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14210
Loading