RLVR vs. Distillation: Understanding Accuracy and Capability in LLM Mathematical Reasoning

Minwu Kim; Anubhav Shrestha; Safal Shrestha; Aadim Nepal; Keith W. Ross

RLVR vs. Distillation: Understanding Accuracy and Capability in LLM Mathematical Reasoning

Minwu Kim, Anubhav Shrestha, Safal Shrestha, Aadim Nepal, Keith W. Ross

Published: 17 Oct 2025, Last Modified: 21 Nov 2025MATH-AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reasoning, mathematical reasoning, LLM, RLVR, distillation

Abstract: Recent studies have shown that reinforcement learning with verifiable rewards (RLVR) enhances overall accuracy (pass@1) but often fails to improve capability (pass@$k$) in mathematical reasoning of LLMs, while distillation can improve both. In this paper, we investigate the mechanisms behind these phenomena. First, we demonstrate that RLVR struggles to improve capability because it focuses on improving the accuracy of the less-difficult questions to the detriment of the accuracy of the most difficult questions, thereby leading to no improvement in capability. Second, from the experiment distilling teacher responses to in-distribution problems, we find that capability does not always improve with distillation. We conjecture that capability improves only when new knowledge is introduced, whereas distilling reasoning patterns only improves accuracy but not capability, sacrificing performance on the most difficult questions, similar to RLVR. Together, these findings offer a clearer understanding of how RLVR and distillation shape reasoning behavior in language models.

Submission Number: 119

Loading