A Comedy of Estimators: On KL Regularization in RL Training of LLMs

A Comedy of Estimators: On KL Regularization in RL Training of LLMs

ICLR 2026 Conference Submission12894 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Kullback-Leibler Estimation, Reasoning, LLMs, reinforcement learning

TL;DR: We provide a systematic study of the design choices pertaining to use of KL Divergence in the context of RL training of LLMs

Abstract: The reasoning performance of large language models (LLMs) can be substantially improved by training them with reinforcement learning (RL). The RL objective for LLM training involves a regularization term, which is the reverse KullbackLeibler (KL) divergence between the trained policy and the reference policy. Since computing the KL divergence exactly is intractable, various estimators are used in practice to estimate it from on-policy samples. Despite its wide adoption, including in several open-source libraries, there is no systematic study analyzing the numerous ways of incorporating KL estimators in the objective and their effect on the downstream performance of RL-trained models. Tang & Munos (2025) show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation. In this paper, we further analyze these practices and study the gradients of several estimators, revealing how design choices shape gradient bias. We substantiate these findings with empirical observations by RL fine-tuning Qwen2.5-7B and Llama-3.1-8B-Instruct with different configurations and evaluating their performance on both in- and out-of-distribution tasks. Through our analysis, we observe that: (1) estimator configurations with biased gradients can result in training instabilities; and (2) using estimator configurations resulting in unbiased gradients leads to better performance on in-domain as well as out-of-domain tasks. Overall, our findings provide useful takeaways for using KL-regularized objectives during RL post-training of LLMs.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 12894

Loading