Why RL Updates Look Sparse: An Implicit Compass Drives Optimization Bias

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: RLVR, Reasoning
TL;DR: For the first time, we identify and provide a mechanism analysis of the optimization bias phenomenon on RLVR
Abstract: Reinforcement learning (RL) reliably improves LLM reasoning while appearing to change only a small fraction of parameters. We revisit this paradox and argue that the visible sparsity is not the phenomenon itself but the trace of a persistent optimization bias, where RLVR stubbornly commits updates to preferred regions that remain invariant across datasets and RL variants, as if guided by an implicit compass. We propose a Three‑Gate Theory to formalize this mechanism, where the Anchor Gate I shows RL induces a one‑step policy‑KL leash that keeps updates proximal to the base policy; This constrained update is then steered by Gate II (Model Geometry) towards lower-curvature, spectra-preserving directions, a data-invariant feature; and finally, it is filtered by Gate III (Precision), where the bfloat16 format acts as a lens that amplifies the bias by hiding micro-updates, making the underlying pattern visible as apparent sparsity. Empirically, we validate this theory with a comprehensive suite of experiments. We show that RL preserves the model’s spectral structure and avoids its principal weights, in sharp contrast to SFT, which alters spectra and mainly targets those weights. Causal interventions confirm that this bias is destroyed when the model's geometry is disrupted, proving that the geometry is the steering core of the "compass." By providing the first parameter-level account of RLVR's training dynamics, our work not only demystifies its optimization bias but also provides a new perspective of understanding RLVR. Crucially, we show that RL operates in a distinct optimization regime from SFT, directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants, motivating the design of efficient geometry-aware, RLVR-native learning algorithms.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4566
Loading