Keywords: RLVR, Reasoning
TL;DR: For the first time, we identify and provide the first parameter-space account of RLVR’s training dynamics
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) reliably improves the reasoning performance of large language models, yet it appears to modify only a small fraction of parameters.
We revisit this paradox and show that sparsity is a surface artifact of a model-conditioned optimization bias: for a fixed pretrained model, updates consistently localize to preferred parameter regions, highly consistent across runs and largely invariant to datasets and RL recipes.
We mechanistically explain these dynamics with a Three-Gate Theory:
Gate I (KL anchor) imposes a KL-constrained update;
Gate II (model geometry) steers the step off principal directions into low-curvature, spectrum-preserving subspaces; and
Gate III (precision) hides micro-updates in non-preferred regions, making the off-principal bias appear as sparsity.
We then validate this theory and, for the first time, provide a parameter-level characterization of RLVR’s learning dynamics: RLVR learns off principal directions in weight space, exhibiting minimal spectral drift, substantially smaller principal-subspace rotation, and off-principal update alignment, whereas SFT targets principal weights and distorts the spectrum.
Together, these results provide the first parameter-space account of RLVR’s training dynamics, revealing clear regularities in how parameters evolve.
Crucially, we show that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants.
We hope this work charts a path toward a white-box understanding of RLVR and the design of geometry-aware, RLVR-native learning algorithms, rather than repurposed SFT-era heuristics.
Submission Number: 214
Loading