Information-Theoretic Reward Decomposition for Generalizable RLHF

Liyuan Mao; Haoran Xu; Amy Zhang; Weinan Zhang; Chenjia Bai

Information-Theoretic Reward Decomposition for Generalizable RLHF

Liyuan Mao, Haoran Xu, Amy Zhang, Weinan Zhang, Chenjia Bai

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: Reinforcement Learning from Human Feedback, Reward Learning, Large Language Models

TL;DR: In this paper, we decompose the reward value into prompt-free reward and prompt-related reward from a information-theoretic perspective, and use the former to guide reward training.

Abstract: Obtaining a generalizable reward model is crucial in Reinforcement Learning from Human Feedback (RLHF) as it enables correctly evaluating unseen prompt-response pairs. However, existing reward models lack this ability, as they are typically trained by increasing the reward gap between chosen and rejected responses, while overlooking the prompts that the responses are conditioned on. Consequently, when the trained reward model is evaluated on prompt-response pairs that lie outside the data distribution, neglecting the effect of prompts may result in poor generalization of the reward model. To address this issue, we decompose the reward value into two independent components: prompt-free reward and prompt-related reward. Prompt-free reward represents the evaluation that is determined only by responses, while the prompt-related reward reflects the reward that derives from both the prompt and the response. We extract these two components from an information theoretical perspective, which requires no extra models.. Subsequently, we propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values. Through toy examples, we demonstrate that the extracted prompt-free and prompt-related rewards effectively characterize two parts of the reward model. Further, standard evaluations show that our method improves both the alignment performance and the generalization capability of the reward model.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 8013

Loading