\section{PESSIMISTIC LEARNING PRINCIPLES}\label{sec:certificates}
We now discuss the theoretical guarantees and pessimistic learning principles previously derived in the literature. An extended related work section can be found in \cref{sec:related_work}. 

Let $\hat{R}$ be an estimator of the risk $R$. In OPL, the goal is to minimize the unknown risk $R$ using the estimator $\hat{R}$. Pessimistic learning principles typically penalize $\hat{R}$, aiming to find $\hat{\pi}_n = \argmin_{\pi \in \Pi} \hat{R}(\pi, S) + \operatorname{pen}(\pi, S)$, with the expectation that $R(\hat{\pi}_n) \approx \min_{\pi \in \Pi} R(\pi)$. The penalization term $\operatorname{pen}(\cdot, S)$ is derived using one of the following methods.

\textbf{The Use of Evaluation Bounds.} \citet{metelli2021subgaussian} derived \emph{evaluation} bounds for the \texttt{Har} regularization in \eqref{eq:regs} and used them to formulate a pessimistic OPL learning principle. Specifically, they showed that the following inequality holds for a \emph{fixed target policy $\pi \in \Pi$} and $\delta \in (0, 1)$
\begin{align}\label{eq:eval_bound}
    \mathbb{P}\big(\big|R(\pi) - \hat{R}(\pi, S) \big|\leq f(\delta, \pi, \pi_0, n)\big) \geq 1-\delta\,,
\end{align}
for some function $f$. Essentially, \eqref{eq:eval_bound} indicates that for a fixed policy $\pi \in \Pi$, the event $|R(\pi) - \hat{R}(\pi, S)| \leq f(\delta, \pi, \pi_0, n)$ holds with high probability. However, this event depends on the target policy $\pi$. %and may change when $\pi$ changes. 
Thus \eqref{eq:eval_bound} is useful for evaluating a \emph{single target policy} when having access to \emph{multiple logged data sets $S$}. This poses a problem for OPL, where we optimize over a potentially \emph{infinite space of policies} using a \emph{single logged data set $S$}. This is the fundamental theoretical limitation of using evaluation bounds similar to \eqref{eq:eval_bound} in OPL. While it is possible to transform \eqref{eq:eval_bound} into a generalization bound that simultaneously holds for any policy $\pi \in \Pi$ by applying a union bound, this approach may result in intractable complexity terms and, consequently, intractable pessimistic learning principles.

\textbf{The Use of One-Sided Generalization Bounds.} Alternatively, generalization bounds \citep{swaminathan2015batch, london2019bayesian, sakhi2022pac} address the limitations of evaluation bounds. These bounds generally take the following form: for $\delta \in (0, 1)$,
\begin{align}\label{eq:opl_one_sided}
    \mathbb{P}\big(\forall \pi \in \Pi, R(\pi) \leq \hat{R}(\pi, S) + f(\delta, \Pi, \pi, &\pi_0, n)\big) \\  &\geq 1-\delta\,,\nonumber
\end{align}
where the function $f$ now depends on the space of policies $\Pi$. The key difference between \eqref{eq:eval_bound} and \eqref{eq:opl_one_sided} is that here the event $R(\pi) \leq \hat{R}(\pi, S_\Pi) + f(\delta, \Pi, \pi, \pi_0, n)$ holds with high probability for all target policies $\pi$. Since this is a high-probability event, we assume it holds for our logged data $S$. This is then used to define the learned policy $\hat{\pi}_n \in \Pi$ as
\begin{align}\label{eq:objective}
 \hat{\pi}_n =  \argmin_{\pi \in \Pi} \hat{R}(\pi, S) + f(\delta, \Pi, \pi, \pi_0, n)\,.
\end{align}
The issue with \eqref{eq:opl_one_sided} is that it is a \emph{one-sided} inequality that does not attest to the quality of the estimator $\hat{R}$. To illustrate, consider that with probability 1, $R(\pi) \leq \hat{R}^{\textsc{poor}}(\pi)$, using a poor estimator of the risk, $\hat{R}^{\textsc{poor}}(\pi) = 0$ for any $\pi \in \Pi$. This holds because, by definition, $R(\pi) \in [-1, 0]$. However, $\hat{R}^{\textsc{poor}}$ is not informative about $R$, making its minimization irrelevant. Thus we need to control the quality of the upper bound on $R$, which is achieved by \emph{two-sided} inequalities
\begin{talign}\label{eq:opl_two_sided}
     \mathbb{P}\big(\forall \pi \in \Pi, |R(\pi) - \hat{R}(\pi, S)| \leq f(\delta, \Pi, \pi, &\pi_0, n)\big) \\  &\geq 1-\delta\,.\nonumber
\end{talign}
Here, the pessimistic learning principle in \eqref{eq:objective} uses the function $f$ from the two-sided inequality in \eqref{eq:opl_two_sided}. In particular, this allows us to derive high-probability inequalities on the suboptimality (SO) gap of $\hat{\pi}_n$, which is the difference $R(\hat{\pi}_n) - R(\pi_*)$. Specifically, we can show that $R(\hat{\pi}_n) - R(\pi_*) \leq 2f(\delta, \Pi, \pi_*, \pi_0, n)$, where $\hat{\pi}_n$ is the learned policy from \eqref{eq:objective} (with $f$ obtained from the two-sided inequality in \eqref{eq:opl_two_sided}) and $\pi_* = \argmin_{\pi \in \Pi} R(\pi)$ is the optimal policy. This demonstrates why pessimism is appealing in OPL: the suboptimality gap of the learned policy $\hat{\pi}_n$, i.e., $R(\hat{\pi}_n) - R(\pi_*)$, is bounded by $2f(\delta, \Pi, \pi_*, \pi_0, n)$, where $f$ is evaluated at the optimal policy $\pi_*$. Consequently, the risk estimator $\hat{R}$ needs to be precise only for the optimal policy, rather than for all policies within the class $\Pi$.

\textbf{The Use of Heuristics.} Many studies have proposed specific heuristics where a simplified function \( g \) is used instead of the theoretical function $f$ in \eqref{eq:objective}. For example, \citet{swaminathan2015batch} minimized the estimated risk while penalizing the empirical variance of the estimator. This approach was inspired by a generalization bound with a function \( f \) that includes a variance term but discards more complicated terms from the bound, such as the covering number of the policy space \( \Pi \). Similarly, \citet{london2019bayesian} parameterized policies by a mean parameter and proposed penalizing the estimated risk by the \( L_2 \) distance between the means of the logging and target policies, discarding all other terms from their generalization bound. While these heuristics lead to tractable and computationally attractive objectives, they often lack theoretical justification and guarantees. We note that pessimistic principles have been used in a different context than regularized IPS estimators. For example, \citet{wang2023oracle} proposed a heuristic approach where the standard (non-regularized) IPS estimator $\hat{R}_{\textsc{ips}}(\pi, S)$ in \eqref{eq:ips_policy_value} is penalized with a pseudo-loss \(\textsc{PL}(\pi, S) = \frac{1}{n} \sum_{i \in [n]}\sum_{a \in \cA} \frac{\pi(a|x_i)}{\pi_0(a|x_i)}\). Precisely, they defined \(\hat{\pi}_n = \argmin_{\pi \in \Pi} \hat{R}_{\textsc{ips}}(\pi, S) + \beta \textsc{PL}(\pi, S)\), where \(\beta\) is a hyperparameter. They upper bounded the suboptimality gap of their $\hat{\pi}_n $ for a specific theoretical choice of \(\beta\).


\textbf{The Use of Implicit Pessimism.} Recently, \citet{gabbianelli2023importance} proposed the use of the \texttt{IX}-estimator in \eqref{eq:regs} in OPL and demonstrated that, with careful analysis, they could obtain tight bounds. They observed that the \texttt{IX}-estimator exhibits asymmetry and thus did not use a single two-sided inequality to derive their bound. Instead, they analyzed each side individually using distinct methods and combined the results to obtain the desired two-sided inequality. In particular, this allowed them to derive an upper bound function \( f \) that depends only on the policy space \(\Pi\), confidence level \(\delta\), and the number of samples \( n \), such that \( f(\delta, \Pi, \pi, \pi_0, n) = f(\delta,\Pi,n) \). This led them to define
\begin{align}
 \hat{\pi}_n &=  \argmin_{\pi \in \Pi} \hat{R}(\pi, S) + f(\delta, \Pi, \pi, \pi_0, n)\,, \\
 &=\argmin_{\pi \in \Pi} \hat{R}(\pi, S) + f(\delta, \Pi, n) = \argmin_{\pi \in \Pi} \hat{R}(\pi, S)\,,\nonumber
\end{align}
where the principle of pessimism becomes equivalent to directly minimizing the estimator since \( f \) does not depend on \(\pi\). This approach is appealing as it avoids computing potentially heavy statistics of the upper bound while still enjoying the benefits of pessimism. However, it requires a careful analysis of the specific IW regularization, whereas we provide a generic bound that holds for any IW regularization. Following \citet{aouali23a}, we directly derive two-sided bounds for regularized IPS, which might be loose depending on the logging policy (\cref{app:bound_tightness}) but still lead to good empirical performance (\cref{sec:experiments}). Investigating similar asymmetric analysis in general regularized IPS is an interesting avenue for future work.


\textbf{Our Approach.} We derive a two-sided generalization bound that holds simultaneously for any policy \(\pi \in \Pi\), as outlined in \eqref{eq:opl_two_sided}. We examine two pessimistic learning principles: directly optimizing the bound or optimizing a simplified penalty inspired by it (heuristic). Both principles apply to any IW regularization, including the standard, non-regularized IPS. Our theory builds on the proof of Pac-Bayesian bounds in \citet{aouali23a}, extending its scope beyond the \texttt{ES} regularization in \eqref{eq:regs} to include other IW regularizations. A limitation of the previous work was the empirical comparison of different pessimistic learning principles, each employing a different IW regularization for IPS. For example, \citet{aouali23a} compared optimizing \texttt{ES}-IPS penalized by their generalization bound with optimizing \texttt{Clip}-IPS penalized by existing bounds (e.g., \citep{sakhi2022pac, london2019bayesian}). Although they demonstrated significant improvements in OPL performance with \texttt{ES}, they did not determine whether these improvements were due to the new IW regularization technique (\texttt{ES} vs. \texttt{Clip}) or the new generalization bound (their bounds vs. those in \citet{sakhi2022pac, london2019bayesian}). This ambiguity motivates our development of a generic generalization bound that applies universally to any IW regularization and also serves as the basis for a generic heuristic inspired by it.