GRM-Omni: Generative Omni-modality Reward Modeling via Meta Reward Learning

Xiaobo Liang; Wanfu Wang; Juntao Li; Guangquan Xue; Qipeng Huang; Linxuan Du; Yuyang Ding; Zecheng Tang; jianye hou; Yan Bowen; Min Zhang

GRM-Omni: Generative Omni-modality Reward Modeling via Meta Reward Learning

Xiaobo Liang, Wanfu Wang, Juntao Li, Guangquan Xue, Qipeng Huang, Linxuan Du, Yuyang Ding, Zecheng Tang, jianye hou, Yan Bowen, Min Zhang

18 Sept 2025 (modified: 07 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reward Modeling；Scalable；Multi-modal

Abstract: Scaling from sparse and underspecified rewards is a fundamental challenge for the continuous improvement of foundation models in decision-making. Reward modeling and policy optimization have traditionally been decoupled, which often results in policies overfitting static reward models and thus limits the scalability and generalization. In this work, we propose a \textbf{meta-reward learning} algorithm that couples discriminative and generative reward models with policy models, producing \textbf{\textit{scalable intrinsic rewards}} that bridge the gap between sparse environmental rewards and the dynamics of policy learning. The goal of meta-reward learning is to train a reward model capable of generalizing effectively across diverse scenarios under limited supervision, such as handling unseen modalities or tasks. In particular, our dual-reward design can attribute each scalar reward to multiple underlying language criteria and iteratively refine their priority, thereby enabling continuous improvement of both policy and reward. We implement \textsc{GRM-Omni}, an omni-modal reward model that not only achieves strong results on multiple multi-modal preference reward benchmarks but also facilitates more effective policy decisions.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 11355

Loading