Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

08 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: dataset, pretraining, code, math, machine learning
TL;DR: SwallowCode and SwallowMath are new datasets that enhance LLM performance in Python code generation and math reasoning by rewriting public data. Using a four-stage pipeline, SwallowCode boosts HumanEval by +17.0, outperforming existing datasets.
Abstract: The performance of large language models (LLMs) in program synthesis and mathematical reasoning is fundamentally limited by the quality of their pre-training corpora. We introduce two openly licensed datasets, released under the Llama 3.3 Community License, that significantly enhance LLM performance by systematically rewriting public data. SwallowCode ($\approx$16.1 billion tokens) refines Python snippets from The-Stack-v2 through a novel four-stage pipeline: syntax validation, pylint-based style filtering, and a two-stage LLM rewriting process that enforces style conformity and transforms snippets into self-contained, algorithmically efficient examples. Unlike prior methods that rely on exclusionary filtering or limited transformations, our transform-and-retain approach upgrades low-quality code, maximizing data utility. SwallowMath ($\approx$2.3 billion tokens) enhances Finemath-4+ by removing boilerplate, restoring context, and reformatting solutions into concise, step-by-step explanations. Within a fixed 50 billion token training budget, continual pre-training of Llama-3.1-8B with SwallowCode boosts pass@1 by \textbf{+17.0} on HumanEval and \textbf{+17.7} on HumanEval+ compared to Stack-Edu, surpassing the baseline model's code generation capabilities. Similarly, substituting SwallowMath yields \textbf{+12.4} accuracy on GSM8K and \textbf{+7.6} on MATH. Ablation studies confirm that each pipeline stage contributes incrementally, with rewriting delivering the largest gains. All datasets, prompts, and checkpoints are publicly available, enabling reproducible research and advancing LLM pre-training for specialized domains.
Croissant File: zip
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Dataset URL: https://github.com/rioyokotalab/swallow-code-math
Code URL: https://github.com/rioyokotalab/swallow-code-math
Submission Number: 799
Loading