Structure Over Scale: Rethinking Adaptation for Reinforcement Learning with Verifiable Rewards

Allan Kazakov; Abdurrahman Javat

Structure Over Scale: Rethinking Adaptation for Reinforcement Learning with Verifiable Rewards

Allan Kazakov, Abdurrahman Javat

Published: 26 May 2026, Last Modified: 03 Jun 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning with Verifiable Rewards, Parameter-Efficient Fine-Tuning, Structured Adaptation, Spectral Analysis

Abstract: The standard justification for Full Fine-Tuning (FFT) in Reinforcement Learning with Verifiable Rewards (RLVR) rests on a reasonable intuition: reasoning requires expressive weight updates that Low-Rank Adaptation (LoRA) cannot provide. We show this intuition identifies the wrong variable. Through a systematic rank sweep under GRPO, we document rank collapse—a discontinuous performance cliff where increasing LoRA rank beyond a threshold causes catastrophic, irrecoverable policy failure, a phenomenon absent from the SFT literature. A batch-size ablation shows that this failure is not rescued by larger batches under the same one-epoch cold-start GRPO protocol: LoRA ranks 128 and 256 remain near floor across batch sizes 64, 128, and 256, while rank 64 itself falls from 73.1% at batch size 64 to 8.7% and 6.0% at batch sizes 128 and 256. This failure is not generic undertraining: LoRA r=8, DoRA r=16, and QuanTA d=3 remain trainable under the same larger-batch regimes. Spectral analysis suggests a mechanism: collapsed high-rank adapters concentrate update energy into a small number of singular directions, consistent with degenerate optimization rather than distributed reasoning improvement. FFT shows a milder version of the same spectral concentration pattern, achieving lower effective rank than structured adapters despite updating far more parameters. Expressivity alone is therefore not the bottleneck; the structure of the update manifold is. Structured adapters that constrain which high-rank solutions are reachable by gradient descent outperform LoRA and FFT on our primary DeepMath-Hard comparison and remain more robust under the larger-batch stress tests. Across three 8B base models, the relative behavior of low-rank and structured high-rank adapters also correlates with frozen-weight spectral structure and reported pre-training scale, a pattern we term the Model Maturity Hypothesis. We present this as a falsifiable hypothesis rather than a causal law: architecture, tokenizer, and data mixture remain confounded with pre-training scale in the current model set. The operative question for RLVR is not simply whether to use LoRA or FFT, but what structure to impose over the update manifold under a given model, task, and optimization budget.

Submission Number: 53

Loading