Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Ziyu Huan; Yuetai Li; Tianyu Zheng; Xiaoyu Xu; Seungone Kim; Minxin Du; Radha Poovendran; Graham Neubig; Xiang Yue

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Ziyu Huan, Yuetai Li, Tianyu Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue

Published: 02 Mar 2026, Last Modified: 09 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0

Keywords: Large Language Models, LLM reasoning, Transferability

Abstract: Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments using math-only data with two widely-used methods: Reinforcement Learning (RL) and Supervised Finetuning (SFT) with detailed ablations. On top of the observation that RL-tuned models transfer better than SFT-tuned model, we identify on-policy fine-tuning as the key mechanism underlying cross-domain transfer, regardless of whether the training signal comes from RL or supervised learning. Latent-space representation and token-space distribution shift analyses reveal that off-policy SFT induces substantial representation and output drift, while on-policy RL preserves general-domain structure. Our results suggest a need to rethink the post-training recipes, particularly the reliance on off-policy SFT-distilled data to advance reasoning models.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 86

Loading