Dynamic Evaluation of Reward Models via Pairwise Maximum Discrepancy Competition

18 Sept 2025 (modified: 23 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reward Models, LLMs, Dynamic Evaluation, Maximum Discrepancy Competition
TL;DR: We propose a dynamic and cost-efficient framework for evaluating reward models via pairwise maximum discrepancy competition.
Abstract: Reward models (RMs) are essential for aligning large language models with human preferences, making their rigorous and comprehensive evaluation a critical task. However, traditional evaluation methods rely heavily on closed datasets with pre-annotated preference pairs, which often fail to assess the generalization ability of RMs across unseen prompts in open-world scenarios. To overcome these limitations, we introduce the Pairwise Maximum Discrepancy Competition (PMDC) framework, a dynamic and annotation-efficient evaluation approach that adaptively selects informative test cases from a large, unlabeled, open-domain prompt pool. Specifically, the PMDC framework operates by first identifying input pairs that elicit significantly divergent preference scores from two RMs. These discriminative pairs are subsequently evaluated by an advanced large language model (LLM) acting as an oracle, determining which RM produces judgments more closely aligned with human preferences. The resulting pairwise comparisons are aggregated via the Bradley-Terry model, yielding an overall ordinal evaluation of the assessed RMs. We apply PMDC to re-evaluate 10 representative RMs from the RewardBench collection. The results reveal noticeable inconsistencies in RM rankings compared to those derived from conventional benchmarks. Further analysis uncovers the strengths and weaknesses of each model, providing valuable insights for future improvements in reward modeling.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 11095
Loading