Keywords: reinforcement learning, LLM, diversity, KL divergence
TL;DR: KL-regularized RL often collapses to a single reward mode by design. We analyze why and propose a lightweight reward-rescaling method that preserves quality while improving mode coverage.
Abstract: Classical intuitions cast minimizing reverse KL as "mode seeking" and forward KL as "mass covering". In KL-regularized reinforcement learning, however, the regularizer determines _both_ the target distribution's shape _and_ the divergence being implicitly minimized, making its role more nuanced than simply inducing generic mode-seeking or mass-covering behaviour. Specifically, the target distribution is defined jointly by the reward function, the reference model, the type of regularizer, and the regularization strength. We show that under common settings—such as low regularization strength and equal verifiable rewards—both forward and reverse KL regularization tend to specify target distributions whose mass concentrates on a single high-reward region. Thus, the objective itself _by construction_ induces diversity collapse, regardless of the policy optimization algorithm used.
Building on this perspective, we introduce a simple and scalable modification that rescales rewards to induce target distributions assigning substantial probability across _all_ high-reward regions. This yields a principled objective that maintains high solution quality while achieving broad reward-mode coverage. Empirically, this approach improves post-training diversity and performance for Large Language Models and Chemical Language Models, and is effective with either forward or reverse KL regularization, while using either naively fails.
Primary Area: reinforcement learning
Submission Number: 21705
Loading