Reverse-KL Reinforcement Learning Can Sample From Multiple Diverse Modes

Published: 23 Sept 2025, Last Modified: 07 Dec 2025FoRLM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: RL, LLM, RLHF, KL, diversity, regularization
TL;DR: Theory on why the LLM RL objective naturally lead to mode collapse, and a simple principled algorithm to fix it
Abstract: It is commonly believed that optimizing the reverse KL divergence result in "mode seeking", while optimizing forward KL result in "mass covering", with the latter being preferred if the goal is to sample from multiple diverse modes. We show---mathematically and empirically---that this intuition does not necessarily transfer well to doing reinforcement learning with reverse/forward KL regularization (as used with verifiable rewards, human feedback, and reasoning tasks). Instead, the choice of reverse/forward KL determines the *family* of target distributions which maximizes the objective, while mode coverage depends primarily on other factors, such as regularization strength. Further, we show commonly used settings such as low regularization strength and equal verifiable rewards tend to specify uni-modal target distributions, meaning the optimization objective is *by construction* non-diverse. Finally, we leverage these insights to construct a simple, theoretically principled algorithm which explicitly optimizes for a multi-modal target distribution that puts high probability over *all* high quality samples. We show this works to post-train LLMs to have high solution diversity with both forward and reverse KL, when using either the forward or reverse KL naively fails.
Submission Number: 136
Loading