Not All Code Helps: Disentangling the Impact of Code Data on Mathematical Reasoning in Large Language Models
Keywords: Code data, large language models, cross-domain data impact
TL;DR: Explore how code data affects large language model performance
Abstract: Incorporating code into training corpora has become a widely acknowledged practice in the development of modern foundation language models (LMs). Compared with a general Internet corpus, code offers high-quality, well-structured signals that substantially augment the coding proficiency of models. Beyond programming skills, prior research has suggested that code data may also contribute to non-coding capabilities. Nevertheless, through a series of rigorous controlled experiments, we demonstrate that the influence of code on other domains—particularly reasoning—remains limited.
Our principal findings are as follows: (1) Code corpus yields substantial gains in programming-related abilities but only marginal improvements in non-coding tasks. We further observe that code competed with knowledge-intensive tasks. (2) Not all code data enhances the mathematical reasoning ability. We identify core subset that functions as cognitive scaffolding for mathematical reasoning, especially for complex problem-solving scenarios. (3) Formal reasoning (e.g., code reasoning or program-of-thought approaches) provides more pronounced improvements in challenging mathematical reasoning tasks, while natural language–based reasoning proves more effective for simpler reasoning problems.
Finally, by probing the internal mechanisms of LMs, we reveal how training data modulates routing patterns, thereby shaping emergent model behavior. As a central driver of model capability, our findings disentangle domain-specific data into finer-grained, cross-domain ability dimensions and underscore promising directions for future data optimization.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 10928
Loading