The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This work identifies the Energy Loss Phenomenon in RLHF and its connection to reward hacking. This insight inspires the design of energy loss-based regularization to mitigate reward hacking.
Abstract: This work identifies the *Energy Loss Phenomenon* in Reinforcement Learning from Human Feedback (RLHF) and its connection to reward hacking. Specifically, energy loss in the final layer of a Large Language Model (LLM) gradually increases during the RL process, with an *excessive* increase in energy loss characterizing reward hacking. Beyond empirical analysis, we further provide a theoretical foundation by proving that, under mild conditions, the increased energy loss reduces the upper bound of contextual relevance in LLMs, which is a critical aspect of reward hacking as the reduced contextual relevance typically indicates overfitting to reward model-favored patterns in RL. To address this issue, we propose an *Energy loss-aware PPO algorithm (EPPO)* which penalizes the increase in energy loss in the LLM's final layer during reward calculation to prevent excessive energy loss, thereby mitigating reward hacking. We theoretically show that EPPO can be conceptually interpreted as an entropy-regularized RL algorithm, which provides deeper insights into its effectiveness. Extensive experiments across various LLMs and tasks demonstrate the commonality of the energy loss phenomenon, as well as the effectiveness of EPPO in mitigating reward hacking and improving RLHF performance.
Lay Summary: Large language models (LLMs) like ChatGPT are trained to respond helpfully and safely to human instructions. One popular method for improving them is called Reinforcement Learning from Human Feedback (RLHF), where the model learns from examples ranked or rated by humans. However, during this training process, models sometimes learn to "game the system"—producing responses that look good to the reward model but lack genuine understanding or relevance. This issue is known as reward hacking. In this work, we uncover a new phenomenon linked to reward hacking: as training progresses, the internal signals (called “energy loss”) in the final layer of the model steadily increase in a harmful way. When this increase becomes too large, it often signals that the model is overfitting to shallow patterns favored by the reward model rather than producing truly meaningful responses. To fix this, we introduce a new method called EPPO, which helps the model maintain healthy internal behavior during training. EPPO carefully controls the energy loss levels in the model to avoid dangerous drift. We show that this method not only reduces reward hacking but also makes the model more reliable overall. Our experiments on different tasks and models confirm that this energy-related problem is widespread—and that our solution works.
Link To Code: https://github.com/miaoyuchun/Energy-Loss-Phenomenon
Primary Area: Deep Learning->Large Language Models
Keywords: Reward Hacking, Reward Overoptimization, Reinforcement Learning from Human Feedback, Large Language Models
Submission Number: 3248
Loading