\section{Experiments}\label{sec:experiments}
We present our core experiments in this section. Details and additional experiments are provided in \cref{sec:experiments_details}, where we also discuss the tightness of our bound in \cref{app:bound_tightness}. Our code is publicly available on \href{https://github.com/imadaouali/unified-pessimism-opl}{GitHub}.
\subsection{Setting}
We adopt a similar setting to \citet{sakhi2022pac}. We begin with a supervised training set \(\mathcal{S}^{\textsc{tr}}\) and convert it into logged bandit data \(S\) using the standard supervised-to-bandit conversion method \citep{agarwal2014taming}. In this conversion, the label set \(\cA\) serves as the action space, while the input space serves as the context space \(\cX\). We then use \(S\) to train our policies. After training, we evaluate the reward of the learned policies on the supervised test set \(\mathcal{S}^{\textsc{ts}}\). The resulting reward measures the ability of the learned policy to predict the true labels of the inputs in the test set and serves as our performance metric. We use two image classification datasets for this purpose: \texttt{MNIST} \citep{lecun1998gradient} and \texttt{FashionMNIST} \citep{xiao2017fashion}. Although we also explored the \texttt{EMNIST} dataset, it led to similar conclusions, so we did not include it to reduce clutter.

We define the logging policy as $\pi_0 = \pi_{\eta_0 \cdot \mu_0}^{\textsc{sof}}$ as in \eqref{eq:softmax_pac_bayes},
\begin{align}
    \pi^{\textsc{sof}}_{\eta_0 \cdot \mu_0}(a | x) &= \frac{\exp(\eta_0\phi(x)^\top \mu_{0,a})}{\sum_{a^\prime \in \cA}\exp(\eta_0 \phi(x)^\top  \mu_{0, a^\prime})}\,,
\end{align}
where \(\mu_0 = (\mu_{0,a})_{a \in \cA} \in \mathbb{R}^{dK}\) are learned using 5\% of the training set \(\mathcal{S}^{\textsc{tr}}\). The parameter \(\eta_0 \in \mathbb{R}\) is an inverse-temperature parameter that controls the quality of the logging policy \(\pi_0\). Higher values of \(\eta_0\) lead to a better-performing logging policy, while lower values lead to a poorer-performing logging policy. In particular, \(\eta_0 = 0\) corresponds to a uniform logging policy. We set the prior as \(\mathbb{P} = \mathcal{N}(\eta_0 \mu_0, I_{dK})\) in all PAC-Bayesian learning principles considered in these experiments, including ours. We train policies on the remaining 95\% of \(\mathcal{S}^{\textsc{tr}}\) using Adam \citep{kingma2014adam} with a learning rate of 0.1 for 20 epochs. The training objective for learning the policy varies based on the chosen method: we use our theoretical bound in \eqref{eq:objective_pac_bayes}, our proposed heuristic in \eqref{eq:learning_principle}, or other pessimism learning principles found in the literature.

We consider two main experiments. In \cref{subsec:fixed_iw}, we focus on a common IW regularization technique, specifically \texttt{Clip} in \eqref{eq:regs}. We then apply PAC-Bayesian learning principles from the literature that were specifically designed for \texttt{Clip} and compare them with ours applied to \texttt{Clip}. The goal is to demonstrate that our learning principle not only has broader applicability but also outperforms existing ones. After validating the improved performance of our PAC-Bayesian learning principle, we proceed in \cref{subsec:fixed_lp} to compare existing IW regularizations by training policies using our learning principles applied to them. The goal of these experiments is to determine whether there is a particular IW regularization technique that yields improved performance in OPL.


\subsection{Comparing Learning Principles Under a Common IW Regularization}\label{subsec:fixed_iw}
Here, we focus on the impact of different pessimistic learning principles on the performance of the learned policy given a fixed IW regularization method, specifically \texttt{Clip} as defined in \eqref{eq:regs}. Recall that \texttt{Clip} regularizes the IW as \(\hat{w}(x, a) = \frac{\pi(a|x)}{\max(\pi_0(a|x), \tau)}\), with \(\tau\) set to \(1/\sqrt[4]{n}\) following the suggestion in \citet{ionides2008truncated}. To ensure a fair comparison, we consider PAC-Bayesian learning principles from the literature where the theoretical bound was optimized. Specifically, we include two PAC-Bayesian bounds proposed prior to our work for \texttt{Clip}, from \citet{london2019bayesian} and \citet{sakhi2022pac}. We label the baselines as \emph{London et al.} for optimizing the bound from \citet[Theorem 1]{london2019bayesian}, and for \citet{sakhi2022pac}, we distinguish their two bounds as \emph{Sakhi et al. 1} (from \citet[Proposition 1]{sakhi2022pac}, based on \citet{catoni2007pac}) and \emph{Sakhi et al. 2} (from \citet[Proposition 3]{sakhi2022pac}, a Bernstein-type bound). Since both \citet{london2019bayesian} and \citet{sakhi2022pac} used the linear IW regularization trick described in \cref{sec:main_result}, we compare their methods with optimizing our bound in \eqref{corr:lin_reg_main}, a direct consequence of \cref{thm:main_result} when the IW regularization is linear in \(\pi\). Since we use linear IW regularizations, we optimize over Gaussian policies as described in \citet{aouali23a} and briefly discussed in \cref{app:linear_reg}, as these are known to perform better in these scenarios \citep{sakhi2022pac,aouali23a}. Finally, we also include the logging policy as a baseline.


In \cref{fig:sota}, the reward achieved by the learned policy is plotted as a function of the quality (i.e., performance) of the logging policy, \(\eta_0 \in [0, 1]\). This comparison is conducted for learned policies that were optimized using one of the pessimistic learning principles above. The results demonstrate that ours outperforms all baselines across a wide range of logging policies. Thus, in addition to being generic and applicable to a large family of IW regularizers, our approach proves to be more effective than objectives tailored for specific IW regularizations. The enhanced performance of our method holds when \(\eta_0\) is not very close to zero, a more realistic scenario in practical settings where the logging policy typically outperforms a uniform policy. Additionally, note that the performance of the learned policy using any method (including ours) improves upon the performance of the logging policy (indicated by dashed black lines).

Finally, we also conducted an experiment comparing our learning principle, \textbf{Heuristic Optimization}, with the \(L_2\) heuristic from \citet{london2019bayesian}. We found that both heuristics had identical performance (\cref{app:heuristic_comp}).



\begin{figure}
  \centering  \includegraphics[width=0.5\textwidth]{figures/comparison_sota.pdf}
  \caption{Performance of the learned policy with different PAC-Bayes pessimistic learning principles (our \cref{corr:lin_reg_main} and those in \citet{london2019bayesian,sakhi2022pac}) using the \texttt{Clip} IPS risk estimator in \eqref{eq:regs}
.} 
  \label{fig:sota}
\end{figure}


\begin{figure*}[ht]
  \centering  \includegraphics[width=\linewidth]{figures/results_bound_optimization.pdf}
  \caption{Performance of the policy learned by \textbf{Bound Optimization} \eqref{eq:objective_pac_bayes} for different IW regularizations. The \(x\)-axis reflects the quality of the logging policy \(\eta_0 \in [-0.5, 0.5]\). In the first four columns, we plot the reward of the learned policy using a fixed IW regularization technique (\texttt{Clip}, \texttt{Har}, \texttt{IX}, or \texttt{ES} as defined in \eqref{eq:regs}) for various values of its hyperparameter within \([0,1]\). In the last column, we report the mean reward across these hyperparameter values.} 
  \label{fig:main_exp_results}
\end{figure*}

\begin{figure*}[ht]
  \centering  \includegraphics[width=\linewidth]{figures/results_heuristic_optimization.pdf}
  \caption{Performance of the policy learned by \textbf{Heuristic Optimization} \eqref{eq:learning_principle} for different IW regularizations. The \(x\)-axis reflects the quality of the logging policy \(\eta_0 \in [-0.5, 0.5]\). In the first four columns, we plot the reward of the learned policy using a fixed IW regularization technique (\texttt{Clip}, \texttt{Har}, \texttt{IX}, or \texttt{ES} as defined in \eqref{eq:regs}) for various values of its hyperparameter within \([0,1]\). In the last column, we report the mean reward across these hyperparameter values.
  } 
  \label{fig:main_exp_results2}
\end{figure*}


\subsection{Comparing IW Regularizations Under a Common Learning Principle}\label{subsec:fixed_lp}
After demonstrating the favorable performance of our approach in the previous section, we now evaluate its performance with different IW regularization techniques. Specifically, we consider \texttt{Clip}, \texttt{Har}, \texttt{IX}, and \texttt{ES} as defined in \eqref{eq:regs}. We employ both learning principles: one that directly optimizes the theoretical bound and another that optimizes the heuristic derived from it. For the bound optimization, we cannot use \cref{corr:lin_reg_main} since it includes a non-linear IW regularization (\texttt{Har}). Instead, we optimize the bound in \cref{thm:main_result} as explained in the \textbf{Bound Optimization} paragraph in \cref{subsec:lp}. For the heuristic optimization, we use the method described in the \textbf{Heuristic Optimization} paragraph in \cref{subsec:lp}. In this context, we optimize over softmax policies defined as
\begin{align}
    \pi^{\textsc{sof}}_{\theta}(a | x) &= \frac{\exp(\phi(x)^\top \theta_a)}{\sum_{a^\prime \in \cA}\exp(\phi(x)^\top  \theta_{a^\prime})}\,,
\end{align}
where parameters \(\theta\) are learned using either \textbf{Bound Optimization} \eqref{eq:objective_pac_bayes} or \textbf{Heuristic Optimization} \eqref{eq:learning_principle}. \textbf{Bound Optimization} involves a single hyperparameter, \(\lambda\), as defined in \cref{thm:main_result}. We set \(\lambda\) to its optimal value, \(\lambda_*\), which minimizes the bound with respect to \(\lambda\). Our theory does not support this approach since \cref{thm:main_result} requires \(\lambda\) to be fixed in advance, whereas $\lambda_*$ is data-dependent. However, we found this method to yield good empirical performance. On the other hand, \textbf{Heuristic Optimization} relies on three hyperparameters, \(\lambda_1\), \(\lambda_2\), and \(\lambda_3\), which we set to \(\lambda_1 = \lambda_2 = \lambda_3 = 10^{-5}\).


In \Cref{fig:main_exp_results,fig:main_exp_results2}, we present the rewards of the learned policies using different IW regularizations as a function of the quality of the logging policy \(\pi_0\), based on the two proposed learning principles: \textbf{Bound Optimization} in \Cref{fig:main_exp_results} and \textbf{Heuristic Optimization} in \Cref{fig:main_exp_results2}. In both figures, the first and second rows correspond to results on \texttt{MNIST} and \texttt{FashionMNIST}, respectively. In the first four columns, we plot the reward of the learned policy using a fixed IW regularization technique (\texttt{Clip}, \texttt{Har}, \texttt{IX}, or \texttt{ES} as defined in \eqref{eq:regs}) for various values of its hyperparameter within \([0,1]\). In the last column, we report the mean reward across these hyperparameter values to assess the sensitivity of the IW regularization technique to its hyperparameter. The \(x\)-axis represents \(\eta_0\), which controls the quality of the logging policy; higher values indicate better performance of the logging policy. We vary \(\eta_0 \in [-0.5, 0.5]\) to consider logging policies that perform worse than the uniform one (i.e., when \(\eta_0 < 0\)), to highlight settings that might require more IW regularization, although such scenarios may not be realistic.






Our results lead to the following conclusions. In \Cref{fig:main_exp_results}, we observe that all regularizations result in improved performance over the logging policy (i.e., all lines are above the dashed line representing the performance of the logging policy), with the \texttt{Har} regularization showing less improvement. Overall, \texttt{Clip}, \texttt{IX}, and \texttt{ES} achieve comparable performances, as summarized in the far-right column, despite regularizing IWs in very different ways. On the one hand, these results align with the generality of our bound, which applies to all these IW regularizations. On the other hand, they suggest that one can choose any IW regularization method when learning the policy by optimizing the theoretical bound without risking underperformance.


These results and conclusions are further confirmed by the rewards reported in \Cref{fig:main_exp_results2}, where the policies are learned through \textbf{Heuristic Optimization} \eqref{eq:learning_principle}. The performances are even better than those obtained when optimizing the theoretical bound. As discussed in \cref{subsec:lp}, this may be due to the practical optimization of the theoretical bound, where we used Monte Carlo to estimate the expectations, which performs poorly in high-dimensional problems. However, optimizing the heuristic or theoretical bound leads to similar performance when the IW regularizers are linear (\cref{app:linear_reg}) since, in that case, the Monte Carlo estimation was improved. Moreover, as summarized in the far-right column of \Cref{fig:main_exp_results2}, the average performances for all regularizations are comparable, except for \texttt{Har}, which is below the others, and \texttt{ES}, which performs slightly better. Notably, for all regularizations, there is at least one choice of regularization hyperparameter that achieves optimal performance. This finding diverges from \citet{aouali23a}, who attributed significant performance improvements to \texttt{ES} in a similar setting to ours. Our results clarify that these gains may be due to their newly introduced pessimistic learning principle rather than their smooth IW regularization (\texttt{ES}).
