On Designing Effective RL Reward at Training Time for LLM Reasoning

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, RLHF, PPO, LLM for Reasoning, Reward Design
TL;DR: This paper explores innovative reward functions for RL training in improving LLM reasoning, addressing issues of reward hacking and improving a diverse set of 1.5B&7B LLMs on mathematical reasoning benchmarks.
Abstract: Reward models have been increasingly critical for improving the reasoning capability of LLMs. Existing research has shown that a well-trained reward model can substantially improve model performances *at inference time* via search or best-of-N votes. However, the potential of reward models during *RL training time* still remains largely under-explored. It is currently unclear whether these reward models can provide additional training signals to RL training that uses sparse success rewards, which verify the correctness of solutions. In this work, we evaluate popular reward models for RL training, including the Outcome-supervised Reward Model (ORM) and the Process-supervised Reward Model (PRM), and train a collection of LLMs for math problems using RL by combining these learned rewards with success rewards. Surprisingly, even though these learned reward models have strong inference-time performances, they may only bring marginal improvements or even hurt RL *training*, producing worse performances than LLMs trained with the success reward only. We find that *training collapse* easily occurs in RL training when PRM simply serves as reward shaping in addition to the success rewards. Our further analysis reveals two issues that may lead to the sub-optimal performance. Therefore, we introduce two novel reward refinement techniques, including the **Clip** and the **Delta** mechanisms, to tackle the identified issues. We evaluate our techniques with multiple reward models over a set of 1.5B and 7B LLMs on MATH and GSM8K benchmarks, where both **Clip** and **Delta** consistently enhance RL training. Finally, we also demonstrate that with a carefully designed reward function, pure RL training without any additional supervised tuning can further improve all the evaluated LLMs, including the state-of-the-art 7B LLM Qwen2.5-Math-7B-Instruct on MATH and GSM8K benchmarks.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11435
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview