Structure Over Scale: Rethinking Adaptation for Reinforcement Learning with Verifiable Rewards

30 Apr 2026 (modified: 09 May 2026)ICML 2026 Workshop CoLoRAI SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning with Verifiable Rewards, Parameter-Efficient Fine-Tuning, Structured Adaptation, Spectral Analysis
Abstract: The standard justification for Full Fine-Tuning (FFT) in Reinforcement Learning with Verifiable Rewards (RLVR) rests on a reasonable intuition: reasoning requires expressive weight updates that Low-Rank Adaptation (LoRA) cannot provide. We show this intuition identifies the wrong variable. Through a systematic rank sweep under GRPO, we document *rank collapse*—a discontinuous performance cliff where increasing LoRA rank beyond a threshold causes catastrophic, irrecoverable policy failure on moderate batch sizes, a phenomenon absent from the SFT literature. Spectral analysis reveals the mechanism: in a sparse binary reward landscape, unconstrained high-rank adapters allow the optimizer to satisfy rewards through degenerate solutions, bypassing coherent reasoning entirely. FFT exhibits the same pathology in milder form—achieving *lower* effective rank in its learned weight updates than structured adapters using less than 0.6% of the parameters. Expressivity is not the bottleneck; structure is. Structured adapters that constrain *which* high-rank solutions are reachable by gradient descent consistently outperform both LoRA and FFT, and do so more sharply as base-model pre-training scale increases—a pattern we term the *Model Maturity Hypothesis*, supported by behavioral replication across three architecturally independent models and by spectral signatures in frozen base weights that predict adaptation behavior before training begins. The operative question for RLVR is not whether to use LoRA or FFT, but what structure to impose over the update manifold.
Submission Number: 15
Loading