Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models

ACL ARR 2025 February Submission7491 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, based on MRMBench, we introduce an analysis method, \textit{inference-time probing}, that improves the interpretability of the reward prediction. Through extensive experiments, we find that reward models can effectively capture preferences across different dimensions after being trained on preference data. Moreover, the results show that MRMBench strongly correlates with LLM alignment performance, supporting it as a reliable reference for developing advanced reward models.

Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Reward Model, Probing, Preference Representation, Inference-time Probing
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 7491
Loading