Keywords: Reinforcement Learning with Verifiable Rewards, Parameter-Efficient Fine-Tuning, Structured Adaptation, Spectral Analysis
Abstract: The standard justification for Full Fine-Tuning (FFT) in Reinforcement Learning
with Verifiable Rewards (RLVR) rests on a reasonable intuition: reasoning
requires expressive weight updates that Low-Rank Adaptation (LoRA) cannot
provide. We show this intuition identifies the wrong variable. Through a systematic rank sweep under GRPO, we document *rank collapse*—a discontinuous performance cliff where increasing LoRA rank beyond a threshold
causes catastrophic, irrecoverable policy failure on moderate batch sizes, a phenomenon absent from the
SFT literature. Spectral analysis reveals the mechanism: in a sparse binary reward landscape,
unconstrained high-rank adapters allow the optimizer to satisfy rewards through
degenerate solutions, bypassing coherent reasoning entirely. FFT exhibits the same pathology in milder form—achieving *lower* effective
rank in its learned weight updates than structured adapters using less than 0.6%
of the parameters. Expressivity is not the bottleneck; structure is. Structured adapters that constrain *which* high-rank solutions are reachable by gradient descent consistently outperform both LoRA and FFT, and do so more
sharply as base-model pre-training scale increases—a pattern we term the
*Model Maturity Hypothesis*, supported by behavioral replication across
three architecturally independent models and by spectral signatures in frozen
base weights that predict adaptation behavior before training begins. The operative question for RLVR is not whether to use LoRA or FFT, but what structure to impose over the update manifold.
Submission Number: 15
Loading