Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits

Qingyue Zhao; Kaixuan Ji; Heyang Zhao; Tong Zhang; Quanquan Gu

Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits

Qingyue Zhao, Kaixuan Ji, Heyang Zhao, Tong Zhang, Quanquan Gu

Published: 10 Jun 2025, Last Modified: 07 Jul 2025MoFA PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: contextual bandits, policy optimization, $f$-divergence, regularization

TL;DR: We show the first near-optimal sample complexity for offline policy optimization with respect to two classes of $f$-divergence-regularized objectives with weakest coverage requirements.

Abstract: Although many popular reinforcement learning algorithms are underpinned by $f$-divergence regularization, their sample complexity with respect to the regularized objective still lacks a tight characterization. In this paper, we analyze $f$-divergence-regularized offline policy learning. For reverse Kullback–Leibler (KL) divergence, arguably the most commonly used one, we give the first $\tilde{O}(\epsilon^{-1})$ sample complexity under single-policy concentrability for contextual bandits, surpassing existing $\tilde{O}(\epsilon^{-1})$ bound under all-policy concentrability and $\tilde{O}(\epsilon^{-2})$ bound under single-policy concentrability. Our analysis for general function approximation leverages the principle of pessimism in the face of uncertainty to refine a mean-value-type argument to its extreme. This in turn leads to a novel moment-based technique, effectively bypassing the need for uniform control over the discrepancy between any two functions in the function class. The near-optimality of the upper bound is justified by matching lower bounds, which also demonstrate that single-policy concentrability is necessary to maximally exploit the strong convexity of reverse KL. In addition, for $f$-divergences with strongly convex $f$, to which reverse KL does not belong, we show that the sharp sample complexity $\tilde{\Theta}(\epsilon^{-1})$ is achievable even without single-policy concentrability. In this case, the algorithm design can get rid of pessimistic estimators. We further extend our analysis to dueling bandits, and we believe these results take a significant step toward a comprehensive understanding of $f$-divergence-regularized policy learning.

Submission Number: 44

Loading