Evaluating Robustness of Reward Models for Mathematical Reasoning

Sunghwan Kim; Dongjin Kang; Taeyoon Kwon; Hyungjoo Chae; Jungsoo Won; Dongha Lee; Jinyoung Yeo

Evaluating Robustness of Reward Models for Mathematical Reasoning

Sunghwan Kim, Dongjin Kang, Taeyoon Kwon, Hyungjoo Chae, Jungsoo Won, Dongha Lee, Jinyoung Yeo

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: mathematical reasoning, RLHF, reward models, reward overoptimization, language models, benchmark

TL;DR: We propose a design for a reliable benchmark for reward models and validate our design using the results of optimized policies and through the lens of reward overoptimization.

Abstract: Reward models are key in reinforcement learning from human feedback (RLHF) systems, aligning the model behavior with human preferences. Particularly in the math domain, there have been plenty of studies using reward models to align policies for improving reasoning capabilities. Recently, as the importance of reward models has been emphasized, RewardBench is proposed to understand their behavior. However, we figure out that the math subset of RewardBench has different representations between chosen and rejected completions, and relies on a single comparison, which may lead to unreliable results as it considers only an isolated case. Therefore, it fails to accurately present the robustness of reward models, leading to a misunderstanding of its performance and potentially resulting in reward hacking. In this work, we propose a direction for designing benchmarks that reliably evaluate reward models in mathematical reasoning. We conduct comprehensive analyses to validate whether our design effectively reflects the robustness of reward models. The results underscore that the benchmark designed to reduce the possibility of reward hacking and employ one-to-many comparisons strongly correlate with the results of optimized policy, whereas the existing benchmark shows almost no correlation. Furthermore, by analyzing through the lens of reward overoptimization, we show that the design involving multiple comparisons results in a significantly more reliable benchmark. We make our code and data publicly available.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10281

Loading