Abstract: Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these fixed rules lack the flexibility to address the unique characteristics of individual examples, yet crafting sample-wise rules is impractical for human experts. In this paper, we show that even small language models, with only 0.3B parameters, can exhibit substantial data refining capabilities. We propose Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, and enables the model to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experiments show that models trained on ProX-refined data consistently outperform other baselines across 10 benchmarks, demonstrating effectiveness across model sizes (up to 1.7B) and pre-training corpora (C4, RedPajama-V2, FineWeb, FineWeb-Edu, and DCLM).
ProX also shows great potential in continual pre-training: on math domain, ProX boosts 7B models by up to 20% within 10B tokens—results typically achieved with much larger scale training (e.g., 200B tokens).
We believe ProX offers a way to curate high-quality pre-training data, and finally contributes to efficient LLM development.
Lay Summary: Large language models (LLMs) are trained on trillions of words from the web, but much of that data is noisy, duplicated, or meaningless junk. Existing data-cleaning pipelines rely on hundreds of manually crafted rules designed for entire datasets, not for the quirks of each individual example. Writing tailored rules for billions of samples is infeasible for human curators.
Our work, Programming Every Example (ProX), reframes data cleaning as code generation. A lightweight 0.3B-parameter language model writes and executes small, targeted programs—such as string normalization or HTML stripping—for each record. This fine-grained approach allows ProX to preserve valuable content that traditional filters would discard and remove subtle flaws that older heuristics overlook.
We applied ProX to five major corpora: C4, RedPajama-V2, FineWeb, FineWeb-Edu, and DCLM. The resulting datasets are leaner and cleaner, leading to consistent improvements on ten diverse benchmarks across models up to 1.7B parameters. In math-heavy continual pretraining, ProX boosted 7B-parameter models by up to 20% using just 10B additional tokens—gains typically requiring 200B tokens. By reducing data waste, ProX offers a path toward faster, cheaper, and more sustainable language model development.
Link To Code: https://github.com/GAIR-NLP/ProX
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models, Pre-training, Data Refinement, Data Engineering
Submission Number: 10003
Loading