Structure Over Scale: Rethinking Adaptation for Reinforcement Learning with Verifiable Rewards

Structure Over Scale: Rethinking Adaptation for Reinforcement Learning with Verifiable Rewards

05 May 2026 (modified: 11 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The standard justification for Full Fine-Tuning (FFT) in Reinforcement Learning with Verifiable Rewards (RLVR) rests on a reasonable intuition: reasoning requires expressive weight updates that Low-Rank Adaptation (LoRA) cannot provide. We show this intuition identifies the wrong variable. Through a systematic rank sweep under GRPO, we document *rank collapse*---a discontinuous performance cliff where increasing LoRA rank beyond a threshold causes catastrophic, irrecoverable policy failure, a phenomenon absent from the SFT literature. A batch-size ablation shows that this failure is not rescued by larger batches under the same one-epoch cold-start GRPO protocol: LoRA ranks $128$ and $256$ remain near floor across batch sizes $64$, $128$, and $256$, while rank $64$ itself falls from $73.1%$ at batch size $64$ to $8.7%$ and $6.0%$ at batch sizes $128$ and $256$. This failure is not generic undertraining: LoRA $r=8$, DoRA $r=16$, and QuanTA $d=3$ remain trainable under the same larger-batch regimes. Spectral analysis suggests a mechanism: collapsed high-rank adapters concentrate update energy into a small number of singular directions, consistent with degenerate optimization rather than distributed reasoning improvement. FFT shows a milder version of the same spectral concentration pattern, achieving lower effective rank than structured adapters despite updating far more parameters. Expressivity alone is therefore not the bottleneck; the structure of the update manifold is. Structured adapters that constrain which high-rank solutions are reachable by gradient descent outperform LoRA and FFT on our primary DeepMath-Hard comparison and remain more robust under the larger-batch stress tests. Across three 8B base models, the relative behavior of low-rank and structured high-rank adapters also correlates with frozen-weight spectral structure and reported pre-training scale, a pattern we term the Model Maturity Hypothesis. We present this as a falsifiable hypothesis rather than a causal law: architecture, tokenizer, and data mixture remain confounded with pre training scale in the current model set. The operative question for RLVR is not simply whether to use LoRA or FFT, but what structure to impose over the update manifold under a given model, task, and optimization budget.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Marlos_C._Machado1

Submission Number: 8776

Loading