Rethinking KL Regularization in RLHF: From Wrong Value Estimation to Correct Gradient Optimization

Kezhao Liu; Jason Klein Liu; Mingtao Chen; Yiming Liu

Rethinking KL Regularization in RLHF: From Wrong Value Estimation to Correct Gradient Optimization

Kezhao Liu, Jason Klein Liu, Mingtao Chen, Yiming Liu

23 Jan 2026 (modified: 24 Jun 2026)Submitted to ICML 2026EveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We proposed the principle surrogate loss of the reverse KL in RLHF and analyzed that the k3 loss in GRPO is its first-order approximation.

Abstract: Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from numerical value estimation — a practice that overlooks the term's functional role as an optimization loss. To analyze this issue, we establish a unified framework that connects two seemingly distinct implementation styles: using the mathematical term **k_n** as a detached coefficient for the policy's score function (**k_n in reward**) or as a direct loss function through which gradients are propagated (**k_n as loss**). We show that the latter can always be analyzed via an equivalent gradient coefficient in the former, unifying the two perspectives. Through this framework, we first prove that conclusions from value estimation fail to guide proper KL loss design, using the **k_1 as loss** as a counterexample. We then prove the conventional **k_1 in reward** (like PPO) is the principled loss for Reverse KL (RKL) regularization. We further establish a key finding: under on-policy conditions, the **k_2 as loss** formulation is, in fact, gradient-equivalent to **k_1 in reward**. This equivalence, first proven in our work, identifies both as the theoretically sound implementations of the RKL objective. In contrast, we show that the recently adopted **k_3 as loss** (like GRPO) is merely a first-order, biased approximation of the principled loss. Furthermore, we argue that common off-policy implementations of **k_n as loss** methods are biased due to neglected importance sampling, and we propose a principled correction. Our findings provide a comprehensive, gradient-based rationale for choosing and correctly implementing KL regularization, paving the way for more robust and effective RLHF systems.

Originally Submitted Supplementary Material: zip

Primary Area: Deep Learning->Large Language Models

Keywords: RLHF; KL Value Estimation; KL Loss; Gradient Analysis; k3

Submission Number: 21343

Loading