Keywords: RLVR, LLM reasoning
Abstract: Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models.
While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the **magnitude** of these updates, largely overlooking their **direction**.
In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference $\Delta\log p$ between the base and final RLVR models.
Through statistical analysis and token-replacement interventions, we demonstrate that $\Delta\log p$ more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (e.g., divergence or entropy).
Building on this insight, we propose two practical applications:
(1) a *test-time extrapolation* method that amplifies the policy along the learned $\Delta\log p$ direction to improve reasoning accuracy without further training;
(2) a *training-time reweighting* method that focuses learning on low-probability (corresponding to higher $\Delta\log p$) tokens, which improves reasoning performance across models and benchmarks.
Our work establishes the direction of change as a key principle for analyzing and improving RLVR.
Primary Area: reinforcement learning
Submission Number: 16664
Loading