DUAL RM: Beyond Rule-based Preference Reward Modeling via Meta-Reward

ACL ARR 2026 January Submission8019 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reward Modeling,GRPO,Meta-Reward
Abstract: The ability to model \textit{sparse and underspecified rewards}, characteristic of human preferences, is fundamental to scaling Reinforcement Learning (RL). Current preference-based reward modeling largely relies on verifiable rewards, where human-annotated labels define rule-based signals. However, these methods face a fundamental bottleneck we term the \textbf{\textit{Matryoshka Doll Problem}}: \textit{a recursive dependency where each reward verifier requires a meta-verifier}, leading to continuous and costly dependence on human annotation. In this work, we propose \textbf{\textsc{\textcolor{dualrm-name}{Dual RM}}}, which couples discriminative and generative reward models (DisRMs and GenRMs) under a non-parametric \textcolor{dualrm-green}{meta-reward}. Rather than verifying the correctness of GenRM’s reasoning, the meta-reward evaluates its practical impact on response quality. Specifically, GenRM identifies multi-dimensional evaluation rubrics and iteratively refines the response, while DisRM quantifies the quality shifts induced by each rubric. Furthermore, we implement rubric-based test-time scaling to improve sample efficiency and preference alignment under both DPO and GRPO. Our experiments demonstrate that \textbf{\textsc{\textcolor{dualrm-name}{Dual RM}}} achieves strong performance across major preference benchmarks. Notably, even when trained exclusively on language modality, it exhibits robust cross-modal transfer on Omni-RewardBench.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Generative Reward Modeling,Discriminative Reward Modeling,Rubric-based Reward
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: english
Submission Number: 8019
Loading