GRM-Omni: Generative Omni-modality Reward Modeling via Meta Reward Learning

18 Sept 2025 (modified: 07 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reward Modeling;Scalable;Multi-modal
Abstract: Scaling from sparse and underspecified rewards is a fundamental challenge for the continuous improvement of foundation models in decision-making. Reward modeling and policy optimization have traditionally been decoupled, which often results in policies overfitting static reward models and thus limits the scalability and generalization. In this work, we propose a \textbf{meta-reward learning} algorithm that couples discriminative and generative reward models with policy models, producing \textbf{\textit{scalable intrinsic rewards}} that bridge the gap between sparse environmental rewards and the dynamics of policy learning. The goal of meta-reward learning is to train a reward model capable of generalizing effectively across diverse scenarios under limited supervision, such as handling unseen modalities or tasks. In particular, our dual-reward design can attribute each scalar reward to multiple underlying language criteria and iteratively refine their priority, thereby enabling continuous improvement of both policy and reward. We implement \textsc{GRM-Omni}, an omni-modal reward model that not only achieves strong results on multiple multi-modal preference reward benchmarks but also facilitates more effective policy decisions.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 11355
Loading