Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Kazuki Fujii; Yukito Tajima; Sakae Mizuki; Hinari Shimada; Taihei Shiotani; Koshiro Saito; Masanari Ohi; Masaki Kawamura; Taishi Nakamura; Takumi Okamoto; Shigeki Ishida; Kakeru Hattori; Youmi Ma; Hiroya Takamura; Rio Yokota; Naoaki Okazaki

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Ohi, Masaki Kawamura, Taishi Nakamura, Takumi Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, Naoaki Okazaki

08 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: dataset, pretraining, code, math, machine learning

TL;DR: SwallowCode and SwallowMath are new datasets that enhance LLM performance in Python code generation and math reasoning by rewriting public data. Using a four-stage pipeline, SwallowCode boosts HumanEval by +17.0, outperforming existing datasets.

Abstract: The performance of large language models (LLMs) in program synthesis and mathematical reasoning is fundamentally limited by the quality of their pre-training corpora. We introduce two openly licensed datasets, released under the Llama 3.3 Community License, that significantly enhance LLM performance by systematically rewriting public data. SwallowCode ($\approx$16.1 billion tokens) refines Python snippets from The-Stack-v2 through a novel four-stage pipeline: syntax validation, pylint-based style filtering, and a two-stage LLM rewriting process that enforces style conformity and transforms snippets into self-contained, algorithmically efficient examples. Unlike prior methods that rely on exclusionary filtering or limited transformations, our transform-and-retain approach upgrades low-quality code, maximizing data utility. SwallowMath ($\approx$2.3 billion tokens) enhances Finemath-4+ by removing boilerplate, restoring context, and reformatting solutions into concise, step-by-step explanations. Within a fixed 50 billion token training budget, continual pre-training of Llama-3.1-8B with SwallowCode boosts pass@1 by \textbf{+17.0} on HumanEval and \textbf{+17.7} on HumanEval+ compared to Stack-Edu, surpassing the baseline model's code generation capabilities. Similarly, substituting SwallowMath yields \textbf{+12.4} accuracy on GSM8K and \textbf{+7.6} on MATH. Ablation studies confirm that each pipeline stage contributes incrementally, with rewriting delivering the largest gains. All datasets, prompts, and checkpoints are publicly available, enabling reproducible research and advancing LLM pre-training for specialized domains.

Croissant File: zip

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Dataset URL: https://github.com/rioyokotalab/swallow-code-math

Code URL: https://github.com/rioyokotalab/swallow-code-math

Submission Number: 799

Loading