Keywords: reinforcement learning with verifiable rewards, parameter efficient tuning
TL;DR: Optimizing just the first k tokens with a small RL-tuned adapter (“Prefix-RL”) or a Prefix Clustering approach steers a frozen LLM’s solution strategy, recovering much of full RL’s math gains at a tiny compute cost.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a leading approach for tuning language models on mathematical reasoning tasks. However, it remains unclear whether RLVR's gains stem from genuine reasoning improvements or simply from steering the model toward answer formats that already appear in the reference distribution. Inspired by recent evidence \citep{zhao2025echo,yue2025does}, we study this question by optimizing only the first $k$ tokens (e.g. $k=32$) of each solution, generating the remainder of the response from the reference model. We study two methods for prefix optimization, using a naive algorithm that clusters prefixes and selects the best prefix (Prefix Clustering), and a method that optimizes the prefix by finetuning a lightweight adapter model with RL (Prefix-RL). We show that tuning only the first $k$ tokens can significantly improve the accuracy on math, suggesting that at least some of the gains from RL are due to upweighting a preferable solution strategy. Our results suggest that simple prefix optimization methods can provide an efficient alternative to RL, delivering substantial improvements across different models and benchmarks for a tiny fraction of the compute required for standard RL.
Primary Area: reinforcement learning
Submission Number: 20975
Loading