Keywords: Reward Modeling;Scalable;Multi-modal
Abstract: Scaling from sparse and underspecified rewards is a fundamental challenge for the continuous improvement of foundation models in decision-making.
Reward modeling and policy optimization have traditionally been decoupled, which often results in policies overfitting static reward models and thus limits the scalability and generalization.
In this work, we propose a \textbf{meta-reward learning} algorithm that couples discriminative and generative reward models with policy models, producing \textbf{\textit{scalable intrinsic rewards}} that bridge the gap between sparse environmental rewards and the dynamics of policy learning.
The goal of meta-reward learning is to train a reward model capable of generalizing effectively across diverse scenarios under limited supervision, such as handling unseen modalities or tasks.
In particular, our dual-reward design can attribute each scalar reward to multiple underlying language criteria and iteratively refine their priority, thereby enabling continuous improvement of both policy and reward.
We implement \textsc{GRM-Omni}, an omni-modal reward model that not only achieves strong results on multiple multi-modal preference reward benchmarks but also facilitates more effective policy decisions.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 11355
Loading