Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Fan Zhou; Zengzhi Wang; Qian Liu; Junlong Li; Pengfei Liu

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu

24 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Pre-training, Data Refinement, Data Engineering

TL;DR: ProX uses small language models to refine large scale pre-training data via program generation, significantly boosting pre-training models' performance and efficiency across various benchmarks and model scales.

Abstract: Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data curated via selection methods by more than 2% across 10 downstream benchmarks. Its effectiveness spans various model sizes (0.3B~1.7B) and pre-training corpora (C4, RedPajama-V2, and FineWeb). Furthermore, ProX shows great potential in domain-specific continual pre-training: models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving accuracy by 7.6% on Mistral-7B, 14.6% on Llama-2-7B, and 20.3% on CodeLlama-7B within 10B tokens, comparable to Llemma-7B trained on 200B tokens. ProX significantly reduces training FLOPs, offering an efficient path for LLM pre-training.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3785

Loading