KL-Regularization Is Sufficient in Contextual Bandits and RLHF

KL-Regularization Is Sufficient in Contextual Bandits and RLHF

ICLR 2026 Conference Submission16549 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: bandit, RLHF, dueling bandit, KL regularization

TL;DR: We are the first to show that KL regularization alone guarantees sublinear regret in KL-regularized contextual bandits and RLHF, without requiring any additional exploration methods.

Abstract: Recently, reinforcement learning from human feedback (RLHF) has demonstrated remarkable efficiency in fine-tuning large language models (LLMs), fueling a surge of interest in KL regularization. Yet, the theoretical foundations of KL regularization remain underexplored. Many prior works employ either explicit online exploration strategies—such as UCB, Thompson sampling, and forced sampling—or optimism-embedded optimization techniques (e.g., Xie et al. 2024) *in addition to KL regularization* to achieve sublinear regret in online RLHF. In this paper, we show, for the first time to our best knowledge, that such additional exploration strategies are unnecessary if KL regularization is already included. That is, KL regularization alone suffices to guarantee sublinear regret. We propose **KL-EXP** (and its RLHF variant, **OEPO**), an algorithm that achieves logarithmic *KL-regularized* regret—the standard objective in KL-regularized contextual bandits and RLHF—while also attaining $\tilde{\mathcal{O}}(\sqrt{T})$ *unregularized* regret, both under general function approximation. As a special case, in linear contextual bandits, we establish a $\tilde{\mathcal{O}}(\sqrt{dT \log N})$ bound on the unregularized regret, where $d$ is the feature dimension and $N$ is the number of arms. To our best knowledge, this is the first $\tilde{\mathcal{O}}(\sqrt{dT \log N})$-type regret bound achieved without resorting to supLin-type algorithms, making it substantially more practical. Our experiments on linear and neural bandits, as well as on LLM fine-tuning with RLHF, demonstrate that our algorithms significantly outperform the baselines while remaining practical.

Supplementary Material: zip

Primary Area: learning theory

Submission Number: 16549

Loading