Keywords: reasoning, mathematical reasoning, LLM, RLVR, distillation
Abstract: Recent studies have shown that reinforcement learning with verifiable rewards (RLVR) enhances overall accuracy (pass@1) but often fails to improve capability (pass@$k$) in mathematical reasoning of LLMs, while distillation can improve both.
In this paper, we investigate the mechanisms behind these phenomena.
First, we demonstrate that RLVR struggles to improve capability because it focuses on improving the accuracy of the less-difficult questions to the detriment of the accuracy of the most difficult questions, thereby leading to no improvement in capability.
Second, from the experiment distilling teacher responses to in-distribution problems, we find that capability does not always improve with distillation.
We conjecture that capability improves only when new knowledge is introduced, whereas distilling reasoning patterns only improves accuracy but not capability, sacrificing performance on the most difficult questions, similar to RLVR.
Together, these findings offer a clearer understanding of how RLVR and distillation shape reasoning behavior in language models.
Submission Number: 119
Loading