Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers
Track: long paper (up to 10 pages)
Keywords: Reinforcement Learning from Verifiable Rewards, Large Language Models, Reinforcement Learning, GRPO, without ground truth, label free
TL;DR: Using format-length-based surrogate rewards, RL can effectively unlock and approximate ground-truth–based optimization for LLM mathematical reasoning, suggesting RL mainly activates latent reasoning abilities rather than teaching new knowledge.
Abstract: Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks, with Reinforcement Learning (RL) playing a key role in adapting them to specific applications. In mathematical problem solving, however, the reliance on ground truth answers poses significant challenges due to their high collection cost and limited availability.
This work explores the use of simple surrogate signals, format and length, to guide RL training. We find that early training is dominated by format learning, where structural feedback alone accounts for most performance gains. Incorporating length-based rewards further refines outputs by discouraging overly long or short responses, enabling a GRPO approach with format-length signals to approximate (>90\%), and in some cases surpass, ground-truth-based optimization. For example, our method achieves 33.3\% accuracy on AIME2024 and 57.6\% on CRUX-O with a 7B base model, and generalizes across different model sizes and series.
Beyond practical efficiency, these findings provide an inspirational perspective on RL: rather than imparting new knowledge, RL primarily activates reasoning capabilities already embedded in pre-trained models. This insight suggests that lightweight, label-efficient strategies can complement pre-training to unlock LLMs’ latent potential in reasoning-intensive tasks.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 10
Loading