Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers

Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers

ACL ARR 2025 May Submission2088 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models have achieved remarkable success in natural language processing tasks, with Reinforcement Learning playing a key role in adapting them to specific applications. However, obtaining ground truth answers for training LLMs in mathematical problem-solving is often challenging, costly, and sometimes unfeasible. This research delves into the utilization of format and length as surrogate signals to train LLMs for mathematical problem-solving, bypassing the need for traditional ground truth answers. Our study shows that a reward function centered on format correctness alone can yield performance improvements comparable to the standard GRPO algorithm in this phase. Recognizing the limitations of format-only rewards, we incorporate length-based rewards. The resulting GRPO approach, leveraging format-length surrogate signals, not only matches but surpasses the performance of the standard GRPO algorithm relying on ground truth answers in certain scenarios, achieving 40.0\% accuracy on AIME2024 with a 7B base model. Through systematic exploration and experimentation, this research offers a practical solution for training LLMs to solve mathematical problems and reducing the dependence on extensive ground truth data collection.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Language Modeling

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data analysis

Languages Studied: english

Submission Number: 2088

Loading