Keywords: Reinforcement Learning with Verifiable Rewards, Parameter-Efficient Fine-Tuning, Structured Adaptation, Spectral Analysis
Abstract: The standard justification for Full Fine-Tuning (FFT) in Reinforcement Learning
with Verifiable Rewards (RLVR) rests on a reasonable intuition: reasoning
requires expressive weight updates that Low-Rank Adaptation (LoRA) cannot
provide.
We show this intuition identifies the wrong variable.
Through a systematic rank sweep under GRPO, we document rank collapse—a
discontinuous performance cliff where increasing LoRA rank beyond a threshold
causes catastrophic, irrecoverable policy failure, a phenomenon absent from the
SFT literature.
A batch-size ablation shows that this failure is not rescued by larger batches
under the same one-epoch cold-start GRPO protocol: LoRA ranks 128 and 256
remain near floor across batch sizes 64, 128, and 256, while rank 64
itself falls from 73.1% at batch size 64 to 8.7% and 6.0% at batch
sizes 128 and 256.
This failure is not generic undertraining: LoRA r=8, DoRA r=16, and QuanTA
d=3 remain trainable under the same larger-batch regimes.
Spectral analysis suggests a mechanism: collapsed high-rank adapters concentrate update energy into a small number of singular directions, consistent with degenerate optimization rather than distributed reasoning improvement.
FFT shows a milder version of the same spectral concentration pattern, achieving lower effective rank than structured adapters despite updating far more parameters.
Expressivity alone is therefore not the bottleneck; the structure of the update
manifold is. Structured adapters that constrain which high-rank solutions are
reachable by gradient descent outperform LoRA and FFT on our primary
DeepMath-Hard comparison and remain more robust under the larger-batch stress
tests. Across three 8B base models, the relative behavior of low-rank and
structured high-rank adapters also correlates with frozen-weight spectral
structure and reported pre-training scale, a pattern we term the Model Maturity
Hypothesis. We present this as a falsifiable hypothesis rather than a causal law:
architecture, tokenizer, and data mixture remain confounded with pre-training
scale in the current model set. The operative question for RLVR is not simply
whether to use LoRA or FFT, but what structure to impose over the update
manifold under a given model, task, and optimization budget.
Submission Number: 53
Loading