%\section*{ORGANIZATION OF THE SUPPLEMENTARY MATERIAL}
The supplementary material is organized as follows. 
\begin{itemize}
    \item \textbf{\cref{sec:related_work}} includes an extended related work discussion.
    \item \textbf{\cref{proofs:opl}} outlines the proofs of our main results, as well as additional discussions.
    \item \textbf{\cref{sec:experiments_details}} presents our experimental setup for reproducibility, along with supplementary experiments.
\end{itemize}

\section{Extended Related Work}\label{sec:related_work}
The framework of contextual bandits is a widely adopted model for addressing online learning in uncertain environments \citep{lattimore19bandit,auer02finitetime,thompson33likelihood,russo18tutorial,li10contextual,chu11contextual}. This framework naturally aligns with the online learning paradigm, which seeks to adapt in real-time. However, in practical scenarios, challenges arise when dealing with a large action space. While numerous online algorithms have emerged to efficiently navigate the large action spaces in contextual bandits \citep{zong2016cascading,hong2022deep,zhu2022contextual,aoualimixed}, a notable need remains for offline methods that enable the optimization of decision-making based on historical data. Fortunately, we often possess large sample sets summarizing historical interactions with contextual bandit environments. Leveraging this, agents can enhance their policies offline \citep{swaminathan2015batch,london2019bayesian,sakhi2022pac,aouali23a}. This study is primarily dedicated to exploring this offline mode of contextual bandits, often referred to as the \textit{off-policy} formulation \citep{dudik2011doubly,dudik2012sample,dudik14doubly,wang2017optimal,farajtabar2018more}. Off-policy contextual bandits entail two primary tasks. The first task, known as \textit{off-policy evaluation (OPE)}, revolves around estimating policy performance using historical data. This estimation replicates how evaluations would unfold as if the policy is engaging with the environment in real-time. Subsequently, the derived estimator is optimized to find the optimal policy, and this is called \textit{off-policy learning (OPL)} \citep{swaminathan2015batch}. Next, we review both OPE and OPL.





%A contextual bandit \citep{lattimore19bandit} is a popular and practical framework for online learning to act under uncertainty \citep{li10contextual,chu11contextual}. In practice, the action space is large and short-term gains are important. Thus the agent should be \emph{risk-averse} which goes against the core principle of online algorithms that seek to explore the action space for the sake of long-term gains \citep{auer02finitetime,thompson33likelihood,russo18tutorial}. Although some practical algorithms have been proposed to efficiently explore the action space of a contextual bandit \citep{zong2016cascading,hong2022deep,zhu2022contextual,aoualimixed}. A clear need remains for an offline procedure that allows optimizing decision-making using offline data. Fortunately, we have access to logged data about previous interactions. The agent can leverage such data to learn an improved policy \emph{offline} \citep{swaminathan2015batch,london2019bayesian,sakhi2022pac} and consequently enhance the performance of the current system. In this work, we are concerned with this offline, or \emph{off-policy}, formulation of contextual bandits \citep{dudik2011doubly,dudik2012sample,dudik14doubly,wang2017optimal,farajtabar2018more}. Before learning an improved policy, an important intermediary step is to estimate the performance of policies using logged data, as if they were evaluated online. This task is referred to as \emph{off-policy evaluation (OPE)} \citep{dudik2011doubly}. After that, the resulting estimator is optimized to approximate the optimal policy, and this is referred to as \emph{off-policy learning (OPL)} \citep{swaminathan2015batch}. Next, we review both OPE and OPL approaches.


\subsection{Off-Policy Evaluation}
\label{sec:related_work_OPE}

In recent years, OPE has experienced a noticeable surge of interest, with numerous significant contributions \citep{dudik2011doubly,dudik2012sample,dudik14doubly,wang2017optimal,farajtabar2018more,su2019cab,su2020doubly,kallus2021optimal,metelli2021subgaussian,kuzborskij2021confident,saito2022off,sakhi2020blob,jeunen2021pessimistic,saito2023off}. The literature on OPE can be broadly classified into three primary approaches. The first, referred to as the direct method (DM) \citep{jeunen2021pessimistic,aouali2024bayesian}, involves the development of a model designed to approximate expected costs for any context-action pair. This model is subsequently employed to estimate the performance of the policies. This approach is often used in large-scale recommender systems \citep{sakhi2020blob,jeunen2021pessimistic,aouali2022probabilistic,aouali2022reward}. The second approach, known as inverse propensity scoring (IPS) \citep{horvitz1952generalization,dudik2012sample}, aims to estimate the costs associated with the evaluated policies by correcting for the inherent preference bias of the logging policy within the sample dataset. While IPS maintains its unbiased nature when operating under the assumption that the evaluation policy is absolutely continuous concerning the logging policy, it can be susceptible to high variance and substantial bias when this assumption is violated \citep{sachdeva2020off}. In response to the variance issue, various techniques have been introduced, including the clipping of importance weights \citep{ionides2008truncated,swaminathan2015batch}, their smoothing \citep{aouali23a}, and self-normalization \citep{swaminathan2015self}, among others \citep{gilotte2018offline}. The third approach, known as doubly robust (DR) \citep{robins1995semiparametric,bang2005doubly,dudik2011doubly,dudik14doubly,farajtabar2018more}, combines elements from both the direct method (DM) and inverse propensity scoring (IPS). This amalgamation serves to reduce variance in the estimation process. Typically, the accuracy of an OPE estimator $\hat{R}(\pi, S)$, is assessed using the mean squared error (MSE). It's worth mentioning that \citet{metelli2021subgaussian} advocate for the preference of high-probability concentration rates as the favored metric for evaluating OPE estimators. This work focuses primarily on OPL and hence we did not evaluate the regularized IPS on OPE.




\subsection{Off-Policy Learning}
\label{sec:opl_learning_principles}
Prior OPL research has primarily focused on the derivation of learning principles rooted in generalization bounds \emph{under the clipped IPS} estimator. First, \citet{swaminathan2015batch} designed a learning principle that favors policies that simultaneously demonstrate low estimated cost and empirical variance. Furthermore, \citet{faury2020distributionally} extended this concept by incorporating distributional robustness optimization, while \citet{zenati2020counterfactual} adapted it to continuous action spaces. The latter also proposed a softer importance weight regularization that is different from clipping. Additionally, \citet{london2019bayesian} has elegantly established a connection between PAC-Bayes theory and OPL. This connection led to the derivation of a novel PAC-Bayes generalization bound for the clipped IPS. Once again, this bound served as the foundation for the creation of a novel learning principle that promotes policies with low estimated cost and parameters that are in close proximity to those of the logging policy in terms of $L_2$ distance. Additionally, \citet{sakhi2022pac} introduced new generalization bounds tailored to the clipped IPS. A notable feature of their approach is the direct optimization of the theoretical bound, rendering the use of learning principles unnecessary. Expanding upon these advancements, \citet{aouali23a} presented a generalized bound designed for IPS with exponential smoothing. What distinguishes this particular bound from previous ones is its applicability to standard IPS without the prerequisite assumption that importance weights are bounded. They also demonstrated that optimizing this bound for IPS with exponential smoothing results in superior performance compared to optimizing existing bounds for clipped IPS. However, a significant question lingers: is the performance improvement primarily attributed to the enhanced regularization offered by exponential smoothing over clipping, or is it the consequence of the novel bound itself? This uncertainty arises from the fact that the bounds employed for clipping and exponential smoothing differ. In light of these considerations, our work introduces a unified set of generalization bounds, allowing for meaningful comparisons that address this question and contribute to a deeper understanding of the performance dynamics.





 





%Also, \citet{sakhi2022pac} introduced novel generalization bounds tailored the clipped IPS. What sets their approach apart is the direct optimization of the theoretical bound, eliminating the need for the use of learning principles. Building upon all these developments, \citet{aouali23a} proposed a generalized bound for IPS with exponential smooting. What sets this bound apart from all previous ones is that it holds in particular for standard IPS without making the assumption that the importance weights are bounded. Then, they showed that optimizing this bound under ISP with exponential smoothing leads to better performance than optimizing existing bounds under clipped ISP. However, an important question remains, was the increase in performance caused by the better regularization of exponential smooting compared to clippoing or because of the new bound since the bounds used for clipping and exponential smootinh were not the same. Motivated by this, our work proposes a unified generalization bounds allowing such comparisons.


%and they \textbf{TODO.} finish this

%However, there is a limitation to their method. The bounds derived by \citet{sakhi2022pac} still feature a multiplicative dependency on $1/\tau$, rendering them unsuitable for situations where $\tau$ is small and making them inapplicable to standard IPS settings.

%Moreover, our work extends beyond the existing body of research by providing two-sided generalization bounds. This is a notable departure from previous works that primarily derived one-sided generalization bounds. It's important to note that one-sided bounds fail to offer any guarantees regarding the expected performance of the learned policy.





%Additionally, our contributions extend to proposing a distinct estimator that deviates from the conventional clipped IPS commonly used in the context of off-policy learning (OPL). Empirical evidence suggests that our proposed estimator outperforms the traditional approach, showcasing its superiority in practical applications.



%Our current study improves upon these existing approaches in several key ways. First, the dependency on $M$ present in \eqref{eq:svp_bound} renders it inapplicable to standard IPS when $M$ approaches infinity. In contrast, our bound, as presented in \cref{thm:main_result}, does not exhibit a similar dependency on $\alpha$, thus providing generalization guarantees for standard IPS without necessitating the assumption of bounded importance weights. Secondly, the complexity measure $\mathcal{C}_n(\Pi, \delta)$ is often challenging to compute, while our bound is tractable. The KL terms can also be computed or bounded in closed-form for Gaussian and mixed-logit policies. Third, our bound is both differentiable and scalable, whereas the learning principle in \eqref{eq:svp_principle} necessitates additional optimization considerations \citep{swaminathan2015batch}. Fourth, tuning $\lambda$ within the framework of \eqref{eq:svp_principle} poses a considerable challenge when aligned with online metrics. Finally, we adopt a theoretically grounded approach that involves the direct optimization of our bound, eliminating the need for additional hyper-parameter tuning.


% However, the bound introduced by \citet{london2019bayesian} is characterized by a multiplicative dependency on $1/\tau$, rendering it inapplicable to standard IPS when $\tau=0$. Moreover, it is ill-suited for stochastic first-order optimization \citep{robbins1951stochastic} due to the presence of data-dependent quantities within a square root. Additionally, direct optimization of this bound led to minimal practical improvements over the logging policy. As a result, \citet{london2019bayesian} opted for the use of the learning principle in \eqref{eq:l2_principle}, which, while suffering from similar issues as discussed earlier for \citet{swaminathan2015batch}, is scalable. 

\section{MISSING PROOFS AND RESULTS}\label{proofs:opl}
In this section, we prove \cref{thm:main_result}.

\begin{theorem}[\cref{thm:main_result} Restated]
 Let $\lambda>0$,  $n \ge 1$, $\delta \in (0, 1)$, and let $\mathbb{P}$ be a fixed prior on $\Theta$. Then
the following inequality holds with probability at least $1-\delta$ for any distribution $\mathbb{Q}$ on $\Theta$
\begin{align}
    \left|\E{\theta \sim \mathbb{Q}}{R(\pi_{\theta})-\hat{R}(\pi_{\theta}, S)}\right|   \le \sqrt{ \frac{{\textsc{kl}}_{1}(\mathbb{Q})}{2n} } + B_n(\mathbb{Q})  +
\frac{{\textsc{kl}}_{2}(\mathbb{Q})}{n \lambda } + \frac{\lambda}{2}\bar{V}_n(\mathbb{Q})\,,
\end{align}
where ${\textsc{kl}}_{1}(\mathbb{Q})=D_{\mathrm{KL}}(\mathbb{Q} \| \mathbb{P})+\log \frac{4\sqrt{n}}{\delta}$, ${\textsc{kl}}_{2}(\mathbb{Q})=D_{\mathrm{KL}}(\mathbb{Q} \| \mathbb{P})+\log (4 / \delta)$, and 
\begin{align*}
    &\bar{V}_n(\mathbb{Q}) = \frac{1}{n}\sum_{i=1}^n  \mathbb{E}_{\theta \sim \mathbb{Q}}[\E{a \sim \pi_0(\cdot | x_i)}{\hat{w}_\theta
    (x_i, a)^2}
    +\hat{w}_\theta
    (x_i, a_i)^2 c_i^2]\\
    &B_n(\mathbb{Q}) = \frac{1}{n} \sum_{i=1}^{n} \sum_{a \in \cA} \mathbb{E}_{\theta \sim \mathbb{Q}}[|\pi_{\theta}(a | x_i) -\pi_0(a | x_i)\hat{w}_\theta
(x_i, a)|]
\end{align*}
\end{theorem}

\begin{proof}
First, we decompose the difference $\E{\theta \sim \mathbb{Q}}{R(\pi_{\theta})-\hat{R}(\pi_{\theta}, S)}$ as
\begin{align*}
    \E{\theta \sim \mathbb{Q}}{R(\pi_{\theta})-\hat{R}(\pi_{\theta}, S)} =  \underbrace{\E{\theta \sim \mathbb{Q}}{R(\pi_{\theta}) - \frac{1}{n}\sum_{i=1}^n R(\pi_{\theta} | x_i)}}_{I_1} + \underbrace{\frac{1}{n} \sum_{i=1}^n \E{\theta \sim \mathbb{Q}}{R(\pi_{\theta} | x_i) - \frac{1}{n}\sum_{i=1}^n \hat{R}(\pi_{\theta} | x_i)}}_{I_2}\\ + \underbrace{\frac{1}{n}\sum_{i=1}^n \E{\theta \sim \mathbb{Q}}{\hat{R}(\pi_{\theta} | x_i)} - \E{\theta \sim \mathbb{Q}}{\hat{R}(\pi_{\theta}, S)}}_{I_3}\,,
\end{align*}
where 
\begin{align*}
    & R(\pi_{\theta}) = \E{x \sim \nu\,, a \sim \pi_{\theta}(\cdot | x)}{c(x, a)}\,,\\
    & R(\pi_{\theta} | x_i)    = \E{a \sim \pi_{\theta}(\cdot | x_i)}{c(x_i, a)}\,,\\ 
   & \hat{R}(\pi_{\theta} | x_i) = \E{a \sim \pi_0(\cdot | x_i)}{\hat{w}_\theta(x_i, a)c(x_i, a)}\,, \\
   & \hat{R}(\pi_{\theta}, S) = \frac{1}{n} \sum_{i=1}^n \hat{w}_\theta(x_i, a_i)c_i \,,
\end{align*}
where $\hat{w}_\theta(x, a) = g(\pi_\theta(a|x), \pi_0(a|x))$ for some non-negative function $g$. Our goal is to bound $|\E{\theta \sim \mathbb{Q}}{R(\pi_{\theta})-\hat{R}(\pi_{\theta}, S)} |$ and thus we need to bound $|I_1| + |I_2| + |I_3| $. We start with $|I_1|$, \citet[Theorem 3.3]{alquier2021user} yields that following inequality holds with probability at least $1 -\delta/2$ for any distribution $\mathbb{Q}$ on $\Theta$ 
\begin{align}\label{eq:app_i1}
    |I_1| &\leq \sqrt{ \frac{D_{\mathrm{KL}}(\mathbb{Q} \| \mathbb{P})+\log \frac{4\sqrt{n}}{\delta}}{2n}}\,,\nonumber\\
    &= \sqrt{ \frac{{\textsc{kl}}_{1}(\mathbb{Q})}{2n} }\,.
\end{align}
Moreover, $|I_2|$ can be bounded by decomposing it as 
\begin{align*}
    |I_2|&=\left|\E{\theta \sim \mathbb{Q}}{\frac{1}{n} \sum_{i=1}^{n}  \E{a \sim \pi_{\theta}(\cdot | x_i)}{c(x_i, a)}-\frac{1}{n} \sum_{i=1}^{n}\mathbb{E}_{a \sim \pi_0\left(\cdot | x_i\right)}\left[ \hat{w}_\theta(x_i, a)c(x_i, a) \right]}\right|\,, \\
    &=\left|\frac{1}{n} \sum_{i=1}^{n} \sum_{a \in \cA} \E{\theta \sim \mathbb{Q}}{\pi_{\theta}(a | x_i) c(x_i, a) -\pi_0(a | x_i)\hat{w}_\theta(x_i, a)c(x_i, a)}\right| \,,\\
    &\leq \frac{1}{n} \sum_{i=1}^{n} \sum_{a \in \cA}\E{\theta \sim \mathbb{Q}}{ \left|\pi_{\theta}(a | x_i) -\pi_0(a | x_i)\hat{w}_\theta(x_i, a)\right||c(x_i, a)|}\,.
\end{align*}
But $|c(x, a)| \leq 1$ for any $a \in \cA$ and $x \in \cX$. Thus 
\begin{align}\label{eq:app_i2}
   |I_2| &\leq \frac{1}{n} \sum_{i=1}^{n} \sum_{a \in \cA}\E{\theta \sim \mathbb{Q}}{ \left|\pi_{\theta}(a | x_i) -\pi_0(a | x_i)\hat{w}_\theta(x_i, a)\right|}\,,\nonumber\\
   & = B_n(\mathbb{Q})
\end{align}
Finally, we need to bound the main term $|I_3|$. To achieve this, we borrow and adapt the statement of the following technical lemma \citep[Theorem 2.1]{haddouche2022pac} to our setting.

\begin{lemma}\label{lemma:app_maxime}
Let $\mathcal{Z}$ be an instance space and let $S_n=\left(z_i\right)_{i \in [n]}$ be an $n$-sized dataset for some $n \geq 1$. Let $\left(\mathcal{F}_i\right)_{i \in\{0\} \cup [n] }$ be a filtration adapted to $S_n$. Also, let $\Theta$ be a parameter space and $\pi_\theta$ for $\theta \in \Theta$ are the corresponding policies. Now assume that $(f_i\left(S_i, \pi_\theta\right))_{i \in [n]}$ is a martingale difference sequence for any $\theta \in \Theta$, that is for any $i \in [n]$, and $\theta \in \Theta\,,$ we have that $ \mathbb{E}\left[f_i\left(S_i, \pi_\theta\right) | \mathcal{F}_{i-1}\right]=0$. Moreover, for any $\theta \in \Theta$, let $M_n(\theta)=\sum_{i=1}^n f_i\left(S_i, \pi_\theta\right)$. Then for any fixed prior, $\mathbb{P}$, on $\Theta$, any $\lambda>0$, the following holds with probability $1-\delta$ over the sample $S_n$, simultaneously for any $\mathbb{Q}$, on $\Theta$
\begin{align*}
    \left|\E{\theta \sim \mathbb{Q}}{M_n(\theta)}\right| \leq \frac{D_{\mathrm{KL}}(\mathbb{Q} \| \mathbb{P})+\log (2 / \delta)}{\lambda}+\frac{\lambda}{2}\left(\E{\theta \sim \mathbb{Q}}{\langle M\rangle_n(\theta) + [M]_n(\theta)}\right)\,,
\end{align*}
where $ \langle M\rangle_n(\theta)=\sum_{i=1}^n \mathbb{E}\left[f_i\left(S_i, \pi_\theta\right)^2 | \mathcal{F}_{i-1}\right]$ and $[M]_n(\theta)=\sum_{i=1}^n f_i\left(S_i, \pi_\theta\right)^2$.
\end{lemma}

Recall that $\hat{w}_\theta(x, a) = g(\pi_\theta(a|x), \pi_0(a|x))$ for some non-negative function $g$. To apply \cref{lemma:app_maxime}, we need to construct an adequate martingale difference sequence $(f_i(S_i, \pi_\theta))_{i \in [n]}$ for $\theta \in \Theta$ that allows us to retrieve $|I_3|$. To achieve this, we define $S_n = (a_i)_{i \in [n]}$ as the set of $n$ taken actions. Also, we let $(\mathcal{F}_i)_{i \in \{0\}\cup[n]}$ be a filtration adapted to $S_n$. For $\theta \in \Theta$, we define $f_i\left(S_i, \pi_\theta\right)$ as 
\begin{align*}
    f_i\left(S_i, \pi_\theta\right) = f_i\left(a_i, \pi_\theta \right) = \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)} - g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))c(x_i, a_i)\,.
\end{align*}
We stress that $f_i(S_i, \pi_\theta)$ only depends on the last action in $S_i$, $a_i$, and the policy $\pi_\theta$. For this reason, we denote it by $f_i(a_i, \pi_\theta)$. The function $f_i$ is indexed by $i$ since it depends on the fixed $i$-th context, $x_i$. The context $x_i$ is fixed and thus randomness only comes from $a_i \sim \pi_0(\cdot | x_i)$. It follows that the expectations are under $a_i \sim \pi_0(\cdot | x_i)$. First, we have that $\mathbb{E}\left[f_i\left(a_i, \pi_\theta\right) | \mathcal{F}_{i-1}\right]= 0$ for any $i \in [n]\,,$ and $ \theta \in \Theta$. This follows from 
\begin{align*}
    &\mathbb{E}\left[f_i\left(a_i, \pi_\theta\right) | \mathcal{F}_{i-1}\right]  = \mathbb{E}_{a_i \sim \pi_0(\cdot | x_i)}\left[f_i\left(a_i, \pi_\theta\right) \Big| a_1, \ldots, a_{i-1}\right] \,,\\
     &=  \mathbb{E}_{a_i \sim \pi_0(\cdot | x_i)}\left[\E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)} - g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))c(x_i, a_i) \Big| a_1, \ldots, a_{i-1}\right]\,,\\
     &\stackrel{(i)}{=}  \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)} - \mathbb{E}_{a_i \sim \pi_0(\cdot | x_i)}\left[g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))c(x_i, a_i) \Big| a_1, \ldots, a_{i-1}\right]\,.
\end{align*}
In $(i)$ we use the fact that given $x_i$, $ \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)}$ is deterministic. Now $a_i$ does not depend on $a_1, \ldots, a_{i-1}$ since logged data is i.d.d. Hence 
\begin{align*}
    \mathbb{E}_{a_i \sim \pi_0(\cdot | x_i)}\left[g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))c(x_i, a_i) \Big| a_1, \ldots, a_{i-1}\right] &= \mathbb{E}_{a_i \sim \pi_0(\cdot | x_i)}\left[g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))c(x_i, a_i)\right]\,,\\ &= \mathbb{E}_{a \sim \pi_0(\cdot | x_i)}\left[g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)\right]\,.
\end{align*}
It follows that 
\begin{align*}
     \mathbb{E}[f_i&\left(a_i, \pi_\theta\right) | \mathcal{F}_{i-1}] 
     \\
     &=  \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)} - \mathbb{E}_{a_i \sim \pi_0(\cdot | x_i)}\left[g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))c(x_i, a_i) \Big| a_1, \ldots, a_{i-1}\right]\,,\\
    &= \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)} - \mathbb{E}_{a \sim \pi_0(\cdot | x_i)}\left[g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)\right]\,,\\
    &= 0\,.
\end{align*}
Therefore, for any $\theta \in \Theta$, $(f_i(a_i, \pi_\theta))_{i \in [n]}$ is a martingale difference sequence. Hence we apply \cref{lemma:app_maxime} and obtain that the following inequality holds with probability at least $1-\delta/2$ for any $\mathbb{Q}$ on $\Theta$
\begin{align}\label{eq:app_proof_0}
    \left|\E{\theta \sim \mathbb{Q}}{M_n(\theta)}\right| &\leq \frac{D_{\mathrm{KL}}(\mathbb{Q} \| \mathbb{P})+\log (4 / \delta)}{\lambda}+\frac{\lambda}{2}\left(\E{\theta \sim \mathbb{Q}}{\langle M\rangle_n(\theta) + [M]_n(\theta)}\right)\,,\nonumber\\
    &=\frac{{\textsc{kl}}_{2}(\mathbb{Q})}{ \lambda } 
+ \frac{\lambda}{2}\left(\E{\theta \sim \mathbb{Q}}{\langle M\rangle_n(\theta) + [M]_n(\theta)}\right)\,,
\end{align}
where 
\begin{align*}
    M_n(\theta)&=\sum_{i=1}^n f_i\left(a_i, \pi_\theta\right)\,,\\
     \langle M\rangle_n(\theta)&=\sum_{i=1}^n \mathbb{E}\left[f_i\left(a_i, \pi_\theta\right)^2 | \mathcal{F}_{i-1}\right]\,,\\
     [M]_n(\theta)&=\sum_{i=1}^n f_i\left(a_i, \pi_\theta\right)^2\,.
\end{align*}
Now these terms can be decomposed as 
\begin{align}\label{eq:app_proof_1}
    \E{\theta \sim \mathbb{Q}}{M_n(\theta)} &= \sum_{i=1}^n\E{\theta \sim \mathbb{Q}}{f_i\left(a_i, \pi_\theta\right)}\,,\nonumber\\
    &=  \sum_{i=1}^n \E{\theta \sim \mathbb{Q}}{\E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)} - g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))c(x_i, a_i)}\,,\nonumber\\
    & \stackrel{(i)}{=} \sum_{i=1}^n \E{\theta \sim \mathbb{Q}}{\hat{R}(\pi_{\theta} | x_i)} - n\E{\theta \sim \mathbb{Q}}{\hat{R}(\pi_{\theta}, S)}\,,\nonumber\\
    &= nI_3\,,
\end{align}
where we used the fact that $c_i = c(a_i, x_i)$ for any $i \in [n]$ in $(i)$. 

Now we focus on the terms $\langle M\rangle_n(\theta)$ and $ [M]_n(\theta)$. First, we have that 
\begin{align}\label{eq:app_proof_2}
    f_i\left(a_i, \pi_\theta\right)^2 &= \Big(\E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)} - g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))c(x_i, a_i)\Big)^2\,,\\
    &= \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)}^2 + \Big(g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))c(x_i, a_i)\Big)^2 \nonumber \\ & \hspace{2cm} - 2   \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)} g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))c(x_i, a_i)\,,\nonumber\\
    &= \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)}^2 + g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))^2c(x_i, a_i)^2 \nonumber \\ & \hspace{2cm} - 2   \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)} g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))c(x_i, a_i)\,.\nonumber
\end{align}
Moreover, $f_i\left(a_i, \pi_\theta\right)^2$ does not depend on $a_1, \ldots, a_{i-1}$. Thus
\begin{align*}
    \mathbb{E}\left[f_i\left(a_i, \pi_\theta\right)^2 | \mathcal{F}_{i-1}\right] = \mathbb{E}_{a_i \sim \pi_0(\cdot | x_i)}\left[f_i\left(a_i, \pi_\theta\right)^2 | \mathcal{F}_{i-1}\right] = \mathbb{E}_{a_i \sim \pi_0(\cdot | x_i)}\left[f_i\left(a_i, \pi_\theta\right)^2\right]= \mathbb{E}_{a \sim \pi_0(\cdot | x_i)}\left[f_i\left(a, h\right)^2\right]\,.
\end{align*}
Computing $\mathbb{E}_{a \sim \pi_0(\cdot | x_i)}\left[f_i\left(a, h\right)^2\right]$ using the decomposition in \eqref{eq:app_proof_2} yields
\begin{align}\label{eq:app_proof_3}
    \mathbb{E}[f_i&\left(a_i, \pi_\theta\right)^2 | \mathcal{F}_{i-1}] = \mathbb{E}_{a \sim \pi_0(\cdot | x_i)}\left[f_i\left(a, h\right)^2\right] \,,\nonumber\\
    &= -  \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)}^2
+ \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))^2c(x_i, a)^2}
\end{align}
Combining \eqref{eq:app_proof_2} and \eqref{eq:app_proof_3} leads to
\begin{align}\label{eq:app_proof_4}
    \mathbb{E}[f_i&\left(a_i, \pi_\theta\right)^2 | \mathcal{F}_{i-1}] + f_i\left(a_i, \pi_\theta\right)^2 = \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))^2c(x_i, a)^2}+ g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))^2c(x_i, a_i)^2 \nonumber\\ &\hspace{2cm} - 2   \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)} g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))c(x_i, a_i)\,,\nonumber\\
    & \stackrel{(i)}{\leq} \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))^2c(x_i, a)^2}+ g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))^2c(x_i, a_i)^2\,.
\end{align}
The inequality in $(i)$ holds because $- 2   \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))c(x_i, a)} g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))c(x_i, a_i) \leq 0$ since $g$ is non-negative. Therefore, we have that
\begin{align*}
    \langle M\rangle_n(\theta) + [M]_n(\theta) \leq \sum_{i=1}^n  \E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))^2c(x_i, a)^2}+ g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))^2c(x_i, a_i)^2\,.
\end{align*}
It follows that 
\begin{align}\label{eq:app_proof_5}
    &\E{\theta \sim \mathbb{Q}}{\langle M\rangle_n(\theta) + [M]_n(\theta)}\nonumber\\ & \leq \sum_{i=1}^n  \E{\theta \sim \mathbb{Q}}{\E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))^2c(x_i, a)^2}}+ \E{\theta \sim \mathbb{Q}}{g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))^2c(x_i, a_i)^2}\,.
\end{align}
Combining \eqref{eq:app_proof_0} and \eqref{eq:app_proof_5} yields
\begin{align}\label{eq:app_proof_6}
  n |I_3| &= | \sum_{i=1}^n \E{\theta \sim \mathbb{Q}}{\hat{R}(\pi_{\theta} | x_i)} - n\E{\theta \sim \mathbb{Q}}{\hat{R}(\pi_{\theta}, S)}  |\, \nonumber\\
   &\leq \frac{{\textsc{kl}}_{2}(\mathbb{Q})}{ \lambda } + \frac{\lambda}{2}\sum_{i=1}^n  \E{\theta \sim \mathbb{Q}}{\E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))^2c(x_i, a)^2}} \nonumber\\& \hspace{7cm} + \E{\theta \sim \mathbb{Q}}{g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))^2c(x_i, a_i)^2}\,.
\end{align}
This means that the following inequality holds with probability at least $1-\delta/2$ for any distribution $\mathbb{Q}$ on $\Theta$
\begin{align}
   \left|I_3  \right| &\leq \frac{{\textsc{kl}}_{2}(\mathbb{Q})}{n \lambda } + \frac{\lambda}{2n}\sum_{i=1}^n  \E{\theta \sim \mathbb{Q}}{\E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))^2c(x_i, a)^2}}\nonumber\\& \hspace{7cm} + \frac{\lambda}{2n}\sum_{i=1}^n\E{\theta \sim \mathbb{Q}}{g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))^2c(x_i, a_i)^2}\,.
\end{align}
However we know that $c(x, a)^2 \leq 1$ for any $x\in \cX$ and $a \in \cA$ and that $c(x_i, a_i) = c_i$ for any $i \in [n]$. Thus the following inequality holds with probability at least $1-\delta/2$ for any distribution $\mathbb{Q}$ on $\Theta$
\begin{align}\label{eq:app_i3}
   \left|I_3  \right| &\leq \frac{{\textsc{kl}}_{2}(\mathbb{Q})}{n \lambda } + \frac{\lambda}{2n}\sum_{i=1}^n  \E{\theta \sim \mathbb{Q}}{\E{a \sim \pi_0(\cdot | x_i)}{g(\pi_\theta(a | x_i), \pi_0(a | x_i))^2}}+ \frac{\lambda}{2n}\sum_{i=1}^n\E{\theta \sim \mathbb{Q}}{g(\pi_\theta(a_i | x_i), \pi_0(a_i | x_i))^2 c_i^2}\,,\nonumber\\
   &=\frac{{\textsc{kl}}_{2}(\mathbb{Q})}{n \lambda } + \frac{\lambda}{2}\bar{V}_n(\mathbb{Q})\,,
\end{align}
where we use that $g(\pi_\theta(a | x), \pi_0(a | x)) = \hat{w}_\theta(x, a)$. The union bound of \eqref{eq:app_i1} and \eqref{eq:app_i3} combined with the deterministic result in \eqref{eq:app_i2} yields that the following inequality holds with probability at least $1-\delta$ for any distribution $\mathbb{Q}$ on $\Theta$
\begin{align}\label{eq:app_main_inequality}
    &|\E{\theta \sim \mathbb{Q}}{R(\pi_{\theta})-\hat{R}(\pi_{\theta}, S)}|  \leq  \sqrt{ \frac{{\textsc{kl}}_{1}(\mathbb{Q})}{2n} } + B_n(\mathbb{Q})+\frac{{\textsc{kl}}_{2}(\mathbb{Q})}{n \lambda } + \frac{\lambda}{2}\bar{V}_n(\mathbb{Q})\,.
\end{align}
\end{proof}







\section{ADDITIONAL EXPERIMENTS}\label{sec:experiments_details}

\subsection{Detailed Setup}
We begin by employing the well-established supervised-to-bandit conversion method as described in \citet{agarwal2014taming}. Specifically, we work with two sets from a classification dataset: the training set denoted as $\mathcal{S}^{\textsc{tr}}$ and the testing set as $\mathcal{S^{\textsc{ts}}}$. The first step involves transforming the training set, $\mathcal{S}^{\textsc{tr}}$, into a bandit logged data denoted as $S$, following the procedure outlined in \cref{alg:supervised_to_bandit}. This newly created logged data, $S$, is subsequently employed to train our policies. The next phase assesses the effectiveness of the learned policies on the testing set, $\mathcal{S}^{\textsc{ts}}$, as outlined in \cref{alg:supervised_to_bandit_test}. We measure the performance of these policies using the reward obtained by running \cref{alg:supervised_to_bandit_test}. Higher rewards indicate superior performance. In our experimental evaluations, we make use of various image classification datasets, specifically: \texttt{MNIST} \citep{lecun1998gradient} and \texttt{FashionMNIST} \citep{xiao2017fashion}.

The input to \cref{alg:supervised_to_bandit} is a logging policy denoted as $\pi_0$, defined as follows
\begin{align}\label{eq:logging}
&\pi_0(a | x) = \frac{ \exp(\eta_0  \phi(x)^\top \mu_{0, a})}{\sum_{a^\prime \in \cA} \exp(\eta_0  \phi(x)^\top \mu_{0, a^\prime})}\,, & \forall (x,a) \in \cX \times \cA\,.
\end{align}
Here, $\phi(x) \in \real^d$ represents the feature transformation function, which writes $\phi(x) = \frac{x}{\|x\|}$. The parameters $\mu_0 = (\mu_{0,a})_{a \in \cA} \in \real^{dK}$ are learned using a fraction (5\%) of the training set $\mathcal{S}^{\textsc{tr}}$, with the cross-entropy loss. Optimization is carried out using the Adam optimizer \citep{kingma2014adam}. The inverse-temperature parameter $\eta_0$ is a critical factor affecting the performance of the logging policy. A high positive value of $\eta_0$ indicates a well-performing logging policy, while a negative value leads to a lower-performing one. In all experiments, the prior $\mathbb P$ is set as $\mathbb{P}= \cN(\eta_0 \mu_0, I_{dK})$.

\begin{algorithm}
\caption{Supervised-to-bandit: creating the logged data}
\label{alg:supervised_to_bandit}
\textbf{Input.} Classification training dataset $\mathcal{S}^{\textsc{tr}}=\{(x_i, y_i)\}_{i=1}^n$, logging policy $\pi_0$.\\
\textbf{Output.} logged data $S=(x_i, a_i, c_i)_{i \in [n]}$.\\
Initialize $S =\{\}$ \\
\For{$i=1, \dots, n$}
{$a_i \sim \pi_0(\cdot | x_i)$\\
$c_i = - \mathbb{I}_{\{a_i = y_i\}}$\\
$S \gets S \cup \{(x_i, a_i, c_i)\}\,.$ 
}
\end{algorithm}


\begin{algorithm}
\caption{Supervised-to-bandit: testing policies}
\label{alg:supervised_to_bandit_test}
\textbf{Input:} Classification test dataset $\mathcal{S}^{\textsc{ts}}=\{(x_i, y_i)\}_{i=1}^{n_{\textsc{ts}}}$, learned policy $ \hat{\pi}_n$.\\
\textbf{Output:} Test reward $r$.\\
\For{$i=1, \dots, n_{\textsc{ts}}$}
{$a_i \sim \hat{\pi}_n(\cdot | x_i)$\\
$r_i = \mathbb{I}_{\{a_i = y_i\}}$}
$r = \frac{1}{n_{\textsc{ts}}} \sum_{i=1}^{n_{\textsc{ts}}} r_i\,.$
\end{algorithm}

\subsection{Bound Optimization}\label{app:bound_opt}

In this section, the learned policy $\hat{\pi}_n$ is obtained by optimizing the following objective 
\begin{align}\label{eq:app_objective_pac_bayes}
   \argmax_{\mathbb{Q}} \E{\theta \sim \mathbb{Q}}{\hat{R}(\pi_{\theta}, S)} + \sqrt{ \frac{{\textsc{kl}}_{1}(\mathbb{Q})}{2n} } + B_n(\mathbb{Q})  +
\frac{{\textsc{kl}}_{2}(\mathbb{Q})}{n \lambda } + \frac{\lambda}{2}\bar{V}_n(\mathbb{Q})\,,
\end{align}
where the quantities are defined in \cref{thm:main_result} and the learning policies $\pi_\theta$ are defined as softmax policies as
\begin{align}\label{eq:app_softmax_pac_bayes}
    \pi^{\textsc{sof}}_{\theta}(a | x) &= \frac{\exp(\phi(x)^\top \theta_a)}{\sum_{a^\prime \in \cA}\exp(\phi(x)^\top  \theta_{a^\prime})}\,,
\end{align} 
To optimize \eqref{eq:app_objective_pac_bayes}, we employ the local reparameterization trick \citep{kingma2015variational}. Precisely, we set $\mathbb{Q} =  \mathcal{N}\left(\mu, \sigma^2 I_{dK}\right)$ where $\mu \in \real^{dK}$ and $\sigma>0$ are learnable parameters. Then roughly speaking, the terms $\E{\theta \sim \mathbb{Q}}{\hat{R}(\pi_{\theta}, S)}$, $ B_n(\mathbb{Q})$ and $\bar{V}_n(\mathbb{Q})$ in \eqref{eq:app_objective_pac_bayes} are of the form $\E{\theta \sim \mathcal{N}\left(\mu, \sigma^2 I_{dK}\right)}{f(\pi^{\textsc{sof}}_{\theta}(a | x))}$ for some function $f$ and they can be rewritten as
\begin{align*}
  \E{\theta \sim \mathcal{N}\left(\mu, \sigma^2 I_{dK}\right)}{f(\pi^{\textsc{sof}}_{\theta}(a | x))}
  &= \E{\theta \sim \cN(\mu, \sigma^2 I_{dK})}{f\Big(\frac{\exp(\phi(x)^\top \theta_a)}{\sum_{a^\prime \in \cA}\exp(\phi(x)^\top  \theta_{a^\prime})}\Big)}\,,\\
  &= \E{\epsilon \sim \cN(0, I_{K})}{f\Big(\frac{\exp(\phi(x)^\top \mu_a + \sigma \epsilon_a)}{\sum_{a^\prime \in \cA}\exp(\phi(x)^\top  \mu_{a^\prime} + \sigma \epsilon_{a^\prime})}\Big)}\,,
\end{align*}
where we use in the second equality the fact that $\|\phi(x)\|_2=1$ in our experiments since we normalized features. Then the above expectation can be approximated as 
\begin{align*}
  \E{\theta \sim \mathcal{N}\left(\mu, \sigma^2 I_{dK}\right)}{f(\pi^{\textsc{sof}}_{\theta}(a | x))}
  & \approx  \frac{1}{S} \sum_{i \in [S]}{f\Big(\frac{\exp(\phi(x)^\top \mu_a + \sigma \epsilon_{i, a})}{\sum_{a^\prime \in \cA}\exp(\phi(x)^\top  \mu_{a^\prime} + \sigma \epsilon_{i, a^\prime})}\Big)}\,, & 
\epsilon_i  \sim \cN(0, I_{K})\,, \forall i \in [S] \,.
\end{align*}
for some $S\geq 1$. Similarly, the gradients are approximated as 
\begin{align*}
  \nabla_{\mu, \sigma}\E{\theta \sim \mathcal{N}\left(\mu, \sigma^2 I_{dK}\right)}{f(\pi^{\textsc{sof}}_{\theta}(a | x))}
  &\approx  \frac{1}{S} \sum_{i \in [S]}\nabla_{\mu, \sigma}f\Big(\frac{\exp(\phi(x)^\top \mu_a + \sigma \epsilon_{i, a})}{\sum_{a^\prime \in \cA}\exp(\phi(x)^\top  \mu_{a^\prime} + \sigma \epsilon_{i, a^\prime})}\Big)\,, &\epsilon_i  \sim \cN(0, I_{K})\,, \forall i \in [S] \,.
\end{align*}

\subsection{Bound Optimization When IW Regulization is Linear}\label{app:linear_reg}

In this section, the learned policy $\hat{\pi}_n$ is obtained by optimizing the following objective derive from the bound in \cref{corr:lin_reg_main}
\begin{align}
   \argmax_{\mathbb{Q}} \hat{R}(\pi_{\mathbb{Q}}, S) + \sqrt{ \frac{{\textsc{kl}}_{1}(\mathbb{Q})}{2n} } + B_n(\pi_{\mathbb{Q}})  +
\frac{{\textsc{kl}}_{2}(\mathbb{Q})}{n \lambda } + \frac{\lambda}{2}\bar{V}_n(\pi_{\mathbb{Q}})\,,
\end{align}
where ${\textsc{kl}}_{1}(\mathbb{Q})=D_{\mathrm{KL}}(\mathbb{Q} \| \mathbb{P})+\log \frac{4\sqrt{n}}{\delta}$, ${\textsc{kl}}_{2}(\mathbb{Q})=D_{\mathrm{KL}}(\mathbb{Q} \| \mathbb{P})+\log (4 / \delta)$, and
\begin{align*}
    &\bar{V}_n(\pi_{\mathbb{Q}}) = \frac{1}{n}\sum_{i=1}^n  \E{a \sim \pi_0(\cdot | x_i)}{\frac{\pi_{\mathbb{Q}}(a |x_i)}{h(\pi_{0}(a |x_i))^2}}+\frac{\pi_{\mathbb{Q}}(a_i |x_i)}{h(\pi_{0}(a_i |x_i))^2} c_i^2\\
   &B_n(\pi_{\mathbb{Q}}) = 1 - \frac{1}{n} \sum_{i=1}^{n} \sum_{a \in \cA}\pi_0(a | x_i)\frac{\pi_{\mathbb{Q}}(a |x_i)}{h(\pi_{0}(a |x_i))}\,.
\end{align*}
We optimize this objective over learning policies $\pi_\mathbb{Q}$ that are defined as Gaussian policies of the form
\begin{align}\label{eq:gaussian_pac_bayes}
   &\pi^{\textsc{gaus}}_{\mu, \sigma}(a | x) = \E{\theta \sim \cN(\mu, \sigma^2 I_d)}{\pi_\theta(a|x)}\,, & \text{where } \, \pi_\theta(a|x) = \mathds{1}_{\{ \argmax_{a^\prime \in \cA} \phi(x)^\top \theta_{a^\prime} = a \}}\,.
\end{align}
 Note that these Gaussian policies satisfy the form $\pi_{\mathbb Q}(a | x) = \E{\theta \sim \mathbb Q}{\pi_\theta(a|x)}$ where  $\pi_\theta$ is binary, which required by \cref{corr:lin_reg_main}. There is no expectation in the above objective and hence the method described in \cref{app:bound_opt} is no longer needed. All we need is to be able to compute the propensities $\pi_\mathbb{Q}(a|x)$ and gradients of the above objective with respect to $\mathbb{Q}$, which boils down to computing the gradient of $\pi_\mathbb{Q}$ with respect to $\mathbb{Q}$ since the objective above is linear in $\pi_\mathbb{Q}$. Computing propensities and gradients is done as follows. First, \citet{sakhi2022pac} showed that \eqref{eq:gaussian_pac_bayes} can be written as 
\begin{align*}
 \pi^{\textsc{gaus}}_{\mu, \sigma}(a | x)  & =\mathbb{E}_{\epsilon \sim \cN(0, 1)}\Big[\prod_{a^{\prime} \neq a} \Phi\big(\epsilon+\frac{\phi(x)^\top\left(\mu_a-\mu_{a^{\prime}}\right)}{\sigma\|\phi(x)\|}\big)\Big]\,,
\end{align*}
where $\Phi$ is the cumulative distribution function of a standard normal variable. Then, the propensities are approximated as 
\begin{align}
      &\pi^{\textsc{gaus}}_{\mu, \sigma}(a | x)  \approx \frac{1}{S} \sum_{i \in [S]}{\prod_{a^{\prime} \neq a} \Phi\big(\epsilon_i+\frac{\phi(x)^\top\left(\mu_a-\mu_{a^{\prime}}\right)}{\sigma \|\phi(x)\|}\big)}\,, & 
\epsilon_i  \sim \cN(0, 1)\,, \forall i \in [S]\,.
\end{align}
Similarly, the gradient reads 
\begin{align*}
    \nabla_{\mu, \sigma} \pi^{\textsc{gaus}}_{\mu, \sigma}(a | x) =  \mathbb{E}_{\epsilon \sim \cN(0, 1)}\Big[\nabla_{\mu, \sigma}\prod_{a^{\prime} \neq a} \Phi\big(\epsilon+\frac{\phi(x)^\top\left(\mu_a-\mu_{a^{\prime}}\right)}{\sigma \|\phi(x)\|}\big)\Big]\,,
\end{align*}
which can be approximated as 
\begin{align*}
    \nabla_{\mu, \sigma} \pi^{\textsc{gaus}}_{\mu, \sigma}(a | x) =  \frac{1}{S} \sum_{i \in [S]} \nabla_{\mu, \sigma}\prod_{a^{\prime} \neq a} \Phi\big(\epsilon_i+\frac{\phi(x)^\top\left(\mu_a-\mu_{a^{\prime}}\right)}{\sigma \|\phi(x)\|}\big)\,,
\end{align*}








The results are presented in \cref{fig:main_exp_results_app_lin} and are generally consistent with the conclusions drawn in \cref{sec:experiments}. The main distinction is that when utilizing linear IW regularizations and the approach detailed in \cref{corr:lin_reg_main}, all methods exhibit better performance compared to when they are optimized using the theoretical bound in \cref{thm:main_result}, which is applicable to potentially non-linear IW regularizations. This improvement is attributed to the reduction of variance achieved by removing the expectation \(\E{\theta \sim \mathbb Q}{\cdot}\) from the bound and employing Gaussian policies.

\begin{figure}[H]
  \centering  \includegraphics[width=\linewidth]{figures/results_linear.pdf}
  \vspace{-0.8cm}
  \caption{Performance of the policy learned by optimizing the bound in \cref{corr:lin_reg_main} for different IW regularizations. The \(x\)-axis reflects the quality of the logging policy \(\eta_0 \in [-0.5, 0.5]\). In the first three columns, we plot the reward of the learned policy using a fixed IW regularization technique (\texttt{Clip}, \texttt{IX}, or \texttt{ES} as defined in \eqref{eq:regs}) for various values of its hyperparameter within \([0,1]\). In the last column, we report the mean reward across these hyperparameter values.} 
  \label{fig:main_exp_results_app_lin}
\end{figure}

\subsection{Comparing heuristics}\label{app:heuristic_comp}
We also compared our \textbf{Heuristic Optimization} \eqref{eq:learning_principle} with the \(L_2\) heuristic from \citet{london2019bayesian} and found that both heuristics exhibit identical performance (red and blue colors overlap in this plot).

\begin{figure}[H]
  \centering  \includegraphics[width=0.6\linewidth]{figures/comparison_heuristic_optimization_sota.pdf}
  \vspace{-0.4cm}
  \caption{Performance of the learned policy with two learning principles (our \textbf{Heuristic Optimization} \eqref{eq:learning_principle} and the $L_2$ heuristic in \citet{london2019bayesian} with varying values of their hyperparameters in a grid within $[10^{-5}, 10^{-3}]$) using the \texttt{Clip} IPS risk estimator in \eqref{eq:regs} with fixed  \(\tau=1/\sqrt[4]{n}\).} 
  \label{fig:heuristic_comparison}
\end{figure}



\subsection{Tightness of the bound}\label{app:bound_tightness}

We assess the tightness of our bound for a fixed IW regularization on the \texttt{MNIST} dataset. Specifically, we consider the \texttt{Clip} method as defined in \eqref{eq:regs}, which regularizes the IW as \(\hat{w}(x, a) = \frac{\pi(a|x)}{\max(\pi_0(a|x), \tau)}\). We apply \cref{corr:lin_reg_main} to this estimator by setting \(h(p) = \max(p, \tau)\) and evaluate the bound at the learned policy for different values of \(\tau\). The results are plotted in \cref{fig:bound_tightness}. Generally, the bound is loose when the logging policy performs poorly, i.e., when \(\eta_0 < 0.2\), and it tightens as the performance of the logging policy improves, i.e., as \(\eta_0\) increases. The value of \(\tau\) affects the bound tightness, but the impact is not very significant in the sense that there is no choice of \(\tau\) that leads to a consistently loose bound, irrespective of the logging policy.

\begin{figure}[H]
  \centering  \includegraphics[width=0.4\linewidth]{figures/bound_plot.pdf}
  \vspace{-0.4cm}
  \caption{Tightness of the bound in \cref{corr:lin_reg_main} applied to \texttt{Clip}-IPS in \eqref{eq:regs} with varying values of hyperparameter $\tau$.} 
  \label{fig:bound_tightness}
\end{figure}



