GRAM: A Generative Foundation Reward Model for Reward Generalization

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This study introduces a generative foundation reward model (GRAM), which pre-learns preferences through a two-stage training process and selective label smoothing.
Abstract: In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on labeled human preference data. In this paper, we explore methods that train reward models using both unlabeled and labeled data. Building on the generative models in LLMs, we develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning. We also show that by using label smoothing, we are in fact optimizing a regularized pairwise ranking loss. This result, in turn, provides a new view of training reward models, which links generative models and discriminative models under the same class of training objectives. The outcome of these techniques is a foundation reward model, which can be applied to a wide range of tasks with little or no further fine-tuning effort. Extensive experiments show that this model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, achieving significant performance improvements over several strong baseline models.
Lay Summary: Reward models are essential for aligning large language models (LLMs) with human preferences. Traditionally, these models are trained using labeled data, which can limit their potential. In this study, we propose a new method that combines both labeled and unlabeled data for training reward models. We introduce a generative reward model that first learns from a large amount of unlabeled data and is then fine-tuned with supervised data. Additionally, we demonstrate that using label smoothing during training improves performance by optimizing a regularized ranking loss. This approach bridges generative and discriminative models, offering a new perspective on training reward models. Our model can be easily applied to various tasks without the need for extensive fine-tuning. This means that when aligning LLMs, there is no longer a need to train a reward model from scratch with large amounts of task-specific labeled data. Instead, you can directly apply our reward model or adapt it to align your LLM.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Model, Reward Modeling, RLHF
Submission Number: 15682
Loading