Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers
Abstract: Large Language Models have achieved remarkable success in natural language processing tasks, with Reinforcement Learning playing a key role in adapting them to specific applications. However, obtaining ground truth answers for training LLMs in mathematical problem-solving is often challenging, costly, and sometimes unfeasible. This research delves into the utilization of format and length as surrogate signals to train LLMs for mathematical problem-solving, bypassing the need for traditional ground truth answers.
Our study shows that a reward function centered on format correctness alone can yield performance improvements comparable to the standard GRPO algorithm in this phase. Recognizing the limitations of format-only rewards, we incorporate length-based rewards. The resulting GRPO approach, leveraging format-length surrogate signals, not only matches but surpasses the performance of the standard GRPO algorithm relying on ground truth answers in certain scenarios, achieving 40.0\% accuracy on AIME2024
with a 7B base model.
Through systematic exploration and experimentation, this research offers a practical solution for training LLMs to solve mathematical problems and reducing the dependence on extensive ground truth data collection.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Language Modeling
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data analysis
Languages Studied: english
Submission Number: 2088
Loading