Keywords: bandit, RLHF, dueling bandit, KL regularization
TL;DR: We are the first to show that KL regularization alone guarantees sublinear regret in KL-regularized contextual bandits and RLHF, without requiring any additional exploration methods.
Abstract: Recently, reinforcement learning from human feedback (RLHF) has demonstrated remarkable efficiency in fine-tuning large language models (LLMs), fueling a surge of interest in KL regularization. Yet, the theoretical foundations of KL regularization remain underexplored. Many prior works employ either explicit online exploration strategies—such as UCB, Thompson sampling, and forced sampling—or optimism-embedded optimization techniques (e.g., Xie et al. 2024) *in addition to KL regularization* to achieve sublinear regret in online RLHF. In this paper, we show, for the first time to our best knowledge, that such additional exploration strategies are unnecessary if KL regularization is already included. That is, KL regularization alone suffices to guarantee sublinear regret. To handle general function classes, we assume access to an online regression oracle and propose **KL-EXP** (and its RLHF variant, **OEPO**), which achieves logarithmic KL-regularized regret—the standard objective in KL-regularized contextual bandits and RLHF—while also attaining an *unregularized* regret of $\tilde{\mathcal{O}}(\sqrt{\smash[b]{\log N \cdot T \text{Reg}\_{\text{Sq}}(T) } })$, where $N$ is the number of actions, $T$ is the total number of rounds, and $\text{Reg}\_{\text{Sq}}(T)$ is the online regression oracle bound. To the best of our knowledge, this is the first result to achieve regret with only logarithmic dependence on $N$ in oracle-based contextual bandits. As a special case, in linear contextual bandits, we establish a $\tilde{\mathcal{O}}(\sqrt{dT \log N})$ bound on the unregularized regret, where $d$ is the feature dimension. To our best knowledge, this is the first $\tilde{\mathcal{O}}(\sqrt{dT \log N})$-type regret bound achieved without resorting to supLin-type algorithms, making it substantially more practical.
Supplementary Material: zip
Primary Area: learning theory
Submission Number: 16549
Loading