\section{Introduction}
\label{sec:introduction}


Offline contextual bandits \citep{dudik2011doubly} have gained significant interest as an effective framework for optimizing decision-making using offline data. In this framework, an agent observes a context, takes an action based on a policy, i.e., a probability distribution over a set of actions, and receives a cost that depends on both the action and the context. These sequential interactions, recorded as logged data, serve two purposes in offline scenarios. The first is off-policy evaluation (OPE), which aims to estimate the expected cost (risk) of a fixed target policy using the logged data. The second is off-policy learning (OPL), whose goal is to find a policy that minimizes the risk. In general, OPL relies on OPE's risk estimator.


In OPE, a significant portion of research has focused on the inverse propensity scoring (IPS) estimator of the risk \citep{horvitz1952generalization, dudik2011doubly}. IPS employs importance weights (IWs), which are the ratios between the target policy and the logging policy used to collect data, to estimate the risk of the target policy. Although IPS is unbiased under mild assumptions, it can suffer from high variance, especially when the target and logging policies differ significantly \citep{swaminathan2017off}. To address this issue, various methods have been developed to regularize IPS, primarily by transforming the IWs \citep{bottou2013counterfactual, swaminathan2015batch, su2020doubly, metelli2021subgaussian, aouali23a, gabbianelli2023importance}. While these regularizations introduce some bias, they aim to reduce the estimator's variance. Most of these IW regularizations have been proposed and investigated in the context of OPE, where the primary goal is to enhance the estimator's accuracy, typically measured by mean squared error (MSE). In contrast, off-policy learning (OPL) aims to find a policy with minimal risk. Therefore, it is crucial to determine whether these IW regularizations lead to better performance in OPL.

A common approach in OPL is to learn the policy through pessimistic learning principles \citep{jin2021pessimism}, where the estimated risk is optimized along with a penalty term often derived from generalization bounds. Consequently, previous studies on OPL with regularized IPS estimators have adopted this approach but focused on specific IW regularizations. For example, \citet{swaminathan2015batch} studied the IPS estimator with clipped IWs and proposed learning a policy by minimizing the estimated risk penalized with an empirical variance term. Similarly, \citet{london2019bayesian} suggested an alternative regularization for the same estimator, incorporating an $L_2$ distance to the logging policy. Additionally, \citet{sakhi2022pac} derived tractable generalization bounds for a simplified doubly robust version of the IPS estimator with clipped IWs, using these bounds for their pessimistic learning principle. Similarly, \citet{aouali23a} derived a tractable bound for an estimator that exponentially smooths the IWs instead of clipping them and proposed two learning principles: one where the bound is optimized and another heuristic inspired by it. Finally, \citet{gabbianelli2023importance} introduced implicit exploration regularization, where a constant is added to the denominator of the IWs and used a learning principle that directly minimizes the corresponding estimator since their generalization upper bound did not depend on the target policy.


A limitation of these studies is that their guarantees and learning principles are specific to the particular IW regularization they consider and are not transferable to other IW regularizations. Consequently, in their OPL experiments, IW regularizations are compared using different learning principles, making it difficult to determine if better performance is due to the enhanced properties of the proposed IW regularizer or merely an artifact of the proposed learning principle. As a result, it remains unclear whether a particular IW regularization yields better performance in OPL. This highlights a gap in the literature: there is no unified study providing bounds on the risk of policies learned using pessimistic learning principles tailored to various regularized IW estimators of the risk. Our work aims to bridge this gap. Specifically, we provide a generic, practical generalization bound and an associated learning principle that apply universally to a large family of IW regularizations, enabling a fair comparison in practice on OPL tasks.



This paper is organized as follows. \Cref{sec:setting} provides the necessary background on IPS estimators and IW regularizations. \Cref{sec:certificates} reviews related work, focusing on the guarantees and learning principles found in the OPL literature. \Cref{sec:main_result} presents our PAC-Bayesian generalization bounds for regularized IPS and introduces our learning principles derived from these bounds. Finally, \Cref{sec:experiments} compares different IW regularizations on real-world datasets.