Keywords: Reward Modal, LLM Evaluation, Benchmark, Reward Variance
TL;DR: RVB shows that variance in RM scores—concentration, separation, stability—predicts faster RLHF convergence; clear, separable scores matter beyond accuracy.
Abstract: Reward models (RMs) provide the core signal in reinforcement learning from human feedback (RLHF). However, most evaluations focus on pairwise accuracy and overlook how the separability and concentration of reward distributions shape the optimization landscape and convergence rate. To address this limitation, we present the Reward Variance Benchmark (RVB), an evaluation suite that quantifies distributional properties of RM scores. RVB introduces three variance-oriented metrics that capture complementary aspects of an RM’s signal: score distribution concentration, global pairwise separation, and cross-prompt decision-style stability. We evaluate 23 widely used RMs with a toolkit that supports reproducible analysis. On this benchmark, the variance metrics yield stable rankings and, together with accuracy, show preliminary links to downstream convergence behavior in a small-scale RLHF case study. These findings support a key insight: in addition to judging responses correctly, an effective RM in RLHF should also score them with sufficient clarity and stability to provide actionable gradients. Overall, RVB provides a first variance-centric predictive benchmark for analyzing RLHF convergence under a fixed setup, while also supporting more open-ended studies of how reward variance interacts with calibration and policy dependence.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 6065
Loading