Keywords: Reward Modeling,GRPO,Meta-Reward
Abstract: The ability to model \textit{sparse and underspecified rewards}, characteristic of human preferences, is fundamental to scaling Reinforcement Learning (RL).
Current preference-based reward modeling largely relies on verifiable rewards, where human-annotated labels define rule-based signals.
However, these methods face a fundamental bottleneck we term the \textbf{\textit{Matryoshka Doll Problem}}: \textit{a recursive dependency where each reward verifier requires a meta-verifier}, leading to continuous and costly dependence on human annotation.
In this work, we propose \textbf{\textsc{\textcolor{dualrm-name}{Dual RM}}}, which couples discriminative and generative reward models (DisRMs and GenRMs) under a non-parametric \textcolor{dualrm-green}{meta-reward}.
Rather than verifying the correctness of GenRM’s reasoning, the meta-reward evaluates its practical impact on response quality.
Specifically, GenRM identifies multi-dimensional evaluation rubrics and iteratively refines the response, while DisRM quantifies the quality shifts induced by each rubric.
Furthermore, we implement rubric-based test-time scaling to improve sample efficiency and preference alignment under both DPO and GRPO.
Our experiments demonstrate that \textbf{\textsc{\textcolor{dualrm-name}{Dual RM}}} achieves strong performance across major preference benchmarks.
Notably, even when trained exclusively on language modality, it exhibits robust cross-modal transfer on Omni-RewardBench.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Generative Reward Modeling,Discriminative Reward Modeling,Rubric-based Reward
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: english
Submission Number: 8019
Loading