everyone
since 09 May 2025">EveryoneRevisionsBibTeXCC BY 4.0
Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, based on MRMBench, we introduce an analysis method, \textit{inference-time probing}, that improves the interpretability of the reward prediction. Through extensive experiments, we find that reward models can effectively capture preferences across different dimensions after being trained on preference data. Moreover, the results show that MRMBench strongly correlates with LLM alignment performance, supporting it as a reliable reference for developing advanced reward models.