Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

ICLR 2025 Conference Submission3785 Authors

24 Sept 2024 (modified: 02 Dec 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Pre-training, Data Refinement, Data Engineering
TL;DR: ProX uses small language models to refine large scale pre-training data via program generation, significantly boosting pre-training models' performance and efficiency across various benchmarks and model scales.
Abstract: Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data curated via selection methods by more than 2% across 10 downstream benchmarks. Its effectiveness spans various model sizes (0.3B~1.7B) and pre-training corpora (C4, RedPajama-V2, and FineWeb). Furthermore, ProX shows great potential in domain-specific continual pre-training: models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving accuracy by 7.6% on Mistral-7B, 14.6% on Llama-2-7B, and 20.3% on CodeLlama-7B within 10B tokens, comparable to Llemma-7B trained on 200B tokens. ProX significantly reduces training FLOPs, offering an efficient path for LLM pre-training.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3785
Loading