KL-Regularized Reinforcement Learning is Designed to Mode Collapse

ICLR 2026 Conference Submission21705 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reinforcement learning, LLM, diversity, KL divergence
Abstract: It is commonly believed that optimizing the reverse KL divergence result in "mode seeking", while optimizing forward KL result in "mass covering", with the latter being preferred if the goal is to sample from multiple diverse modes. We show---mathematically and empirically---that this intuition does not necessarily transfer well to doing reinforcement learning with reverse/forward KL regularization (e.g. as commonly used with language models). Instead, the choice of reverse/forward KL determines the _family_ of target distributions which maximizes the objective, while mode coverage depends primarily on other factors, such as regularization strength. Further, we show commonly used settings such as low regularization strength and equal verifiable rewards tend to specify uni-modal target distributions, meaning the optimization objective is _by construction_ non-diverse. We leverage these insights to construct a simple yet principled algorithm, which makes minimal changes to reward magnitudes, and theoretically prove that it optimizes for a target distribution which puts high probability over _all_ high-quality sampling modes. We empirically show this simple modification works to post-train both Large Language Models and Chemical Language Models to have higher solution quality and diversity, without external signals of diversity, and works with both forward and reverse KL when using either naively fails.
Primary Area: reinforcement learning
Submission Number: 21705
Loading