\section{Theoretical Analysis}\label{sec:main_result} 
We derive our PAC-Bayes generalization bound for the regularized IPS estimator \(\hat{R}(\pi)\) in \eqref{eq:reg_ips_policy_value} under the assumption that \(\hat{w}(x, a) = g(\pi(a | x), \pi_0(a | x))\) for any \((x, a) \in \cX \times \cA\), where \(g: [0, 1] \times [0, 1] \to \real^+\). This assumption is broadly applicable and aligns with known IW regularizations. We make it to explicitly clarify the dependence on \(\pi(a|x)\) and purposefully exclude self-normalized IW, where \(\hat{w}(x_i, a_i) = n w(x_i, a_i)/\sum_{j \in [n]} w(x_j, a_j)\). In self-normalized IW, the regularization depends not only on the specific pair \(x_i, a_i\) but also on all other pairs \(x_j, a_j\), which is not supported by our theory.

\subsection{Introduction to PAC-Bayes theory}\label{subsec:pac_bayes_framekwork}

Consider learning problems specified by an instance space denoted as \(\mathcal{Z}\), a hypothesis space \(\mathcal{H}\) consisting of predictors \(h\), and a loss function \(L : \mathcal{H} \times \mathcal{Z} \rightarrow \real\). Assume access to a dataset \(S = (z_i)_{i \in [n]}\), where $z_1,\dots, z_n$ are i.i.d. from an unknown distribution \(\mathbb{D}\). The risk of a hypothesis \(h\) is defined as \(R(h) = \mathbb{E}_{z \sim \mathbb{D}}[L(h, z)]\), while its empirical counterpart is denoted as \(\hat{R}(h, S) = \frac{1}{n} \sum_{i=1}^n L(h, z_i)\).

In PAC-Bayes, our primary focus is to examine the average generalization capabilities under a distribution \(\mathbb{Q}\) on \(\mathcal{H}\) by controlling the difference between the expected risk under \(\mathbb{Q}\) (expressed as \(\mathbb{E}_{h\sim \mathbb{Q}}[R(h)]\)) and the expected empirical risk under \(\mathbb{Q}\) (expressed as \(\mathbb{E}_{h\sim \mathbb{Q}}[\hat{R}(h, S)]\)).

An example of PAC-Bayes generalization bounds originally proposed by \citet{mcallester1998some} is as follows. Assume that the values of \(L(h, z) \in [0,1]\) for any \((h, z) \in \mathcal{H} \times \mathcal{Z}\), and that we have a fixed prior distribution \(\mathbb{P}\) on \(\mathcal{H}\) and a parameter \(\delta\) that falls within \((0, 1)\). Then, with a probability of at least \(1-\delta\) over the sample set \(S\) drawn from \(\mathbb{D}^n\), it holds simultaneously for any distribution \(\mathbb{Q}\) on \(\mathcal{H}\) that
\begin{talign*}
 \mathbb{E}_{h\sim \mathbb{Q}}[R(h)] \leq \mathbb{E}_{h\sim \mathbb{Q}}[\hat{R}(h, S)]
  + \sqrt{ \frac{D_{\mathrm{KL}}(\mathbb{Q} \| \mathbb{P})+\log \frac{2\sqrt{n}}{\delta}}{2n}}\,,
\end{talign*}
where $D_{\mathrm{KL}}$ denotes the Kullback-Leibler divergence. The reader may refer to \citet{alquier2021user} for a comprehensive introduction to PAC-Bayes theory.


\subsection{Generalization Bounds for OPL}\label{subsec:gbound}
Let \(d'\) be a positive integer, and let \(\Theta \subset \mathbb{R}^{d'}\) be a \(d'\)-dimensional parameter space. We parametrize our learning policies as \(\pi_\theta\), defining our space of policies as \(\Pi = \{\pi_\theta; \theta \in \Theta\}\). An example of this is the softmax policy, parameterized as follows
\begin{align}\label{eq:softmax_pac_bayes}
    \pi^{\textsc{sof}}_{\theta}(a | x) &= \frac{\exp(\phi(x)^\top \theta_a)}{\sum_{a^\prime \in \cA}\exp(\phi(x)^\top  \theta_{a^\prime})}\,,
\end{align}
where \(\theta_a \in \mathbb{R}^d\) and consequently \(\theta = (\theta_a)_{a\in \cA} \in \mathbb{R}^{dK}\), with \(d' = dK\). Moreover, let \(\mathbb{Q}\) be a distribution on the parameter space \(\Theta\). Then PAC-Bayes theory allows us to control the quantity \(\left|\mathbb{E}_{\theta \sim \mathbb{Q}}[R(\pi_\theta) - \hat{R}(\pi_\theta, S)]\right|\), where
\begin{align*}
    R(\pi_\theta) &= \mathbb{E}_{x \sim \nu, a \sim \pi_{\theta}(\cdot | x)}[c(x, a)]\,,\\
    \hat{R}(\pi_\theta, S) &= \frac{1}{n} \sum_{i=1}^n \hat{w}_\theta(x_i, a_i)c_i\,,
\end{align*}
with $\hat{w}_{\theta}(x, a) =g( \pi_{\theta}(a | x),\pi_0(a | x))$. We also assume that the costs are deterministic for ease of exposition. The same result holds for stochastic costs. The proof is provided in \cref{proofs:opl}.

\begin{theorem}\label{thm:main_result}
Let \(\lambda > 0\), \(n \ge 1\), \(\delta \in (0, 1)\), and let \(\mathbb{P}\) be a fixed prior on \(\Theta\). The following inequality holds with probability at least \(1 - \delta\) for any distribution \(\mathbb{Q}\) on \(\Theta\):
\begin{align}\label{eq:app_main_inequality_maint}
    &\left|\mathbb{E}_{\theta \sim \mathbb{Q}}[R(\pi_\theta) - \hat{R}(\pi_\theta, S)]\right| \\
    &\qquad \qquad \leq \sqrt{ \frac{{\textsc{kl}}_1(\mathbb{Q})}{2n} }  + \frac{{\textsc{kl}}_2(\mathbb{Q})}{n \lambda } + B_n(\mathbb{Q})  + \frac{\lambda}{2}\bar{V}_n(\mathbb{Q})\,,\nonumber
\end{align}
where \({\textsc{kl}}_1(\mathbb{Q}) = D_{\mathrm{KL}}(\mathbb{Q} \| \mathbb{P}) + \log \frac{4\sqrt{n}}{\delta}\), \({\textsc{kl}}_2(\mathbb{Q}) = D_{\mathrm{KL}}(\mathbb{Q} \| \mathbb{P}) + \log \frac{4}{\delta}\), and
\begin{align*}
    \bar{V}_n(\mathbb{Q}) = \frac{1}{n}\sum_{i=1}^n \mathbb{E}_{\theta \sim \mathbb{Q}}&\big[\mathbb{E}_{a \sim \pi_0(\cdot | x_i)}[\hat{w}_\theta(x_i, a)^2] \\ &+ \hat{w}_\theta(x_i, a_i)^2 c_i^2\big]\,,\\
  \text{and } \,  B_n(\mathbb{Q}) = \frac{1}{n} \sum_{i=1}^{n} \sum_{a \in \cA} \mathbb{E}_{\theta \sim \mathbb{Q}}&\big[|\pi_{\theta}(a | x_i) \\ & - \pi_0(a | x_i) \hat{w}_\theta(x_i, a)|\big]\,.
\end{align*}
\end{theorem}
%This bound is tractable when the KL terms
%\(D_{\mathrm{KL}}\)can be computed, 
Generally, the bound is tractable due to the conditioning on the contexts \((x_i)_{i \in [n]}\), allowing us to bypass the need for computing the unknown expectation \(\mathbb{E}_{x \sim \nu}[\cdot]\). Recall that the prior \(\mathbb{P}\) is any fixed distribution over \(\Theta\). In particular, if a \(\theta_0\) exists such that the logging policy \(\pi_0 = \pi_{\theta_0}\), then \(\mathbb{P}\) can be specified as Gaussian with mean \(\theta_0\) and some covariance. The terms \(\textsc{kl}_1(\mathbb{Q})\) and \(\textsc{kl}_2(\mathbb{Q})\) contain the divergence \(D_{\mathrm{KL}}(\mathbb{Q} \| \mathbb{P})\), which penalizes posteriors \(\mathbb{Q}\) that deviate significantly from the prior \(\mathbb{P}\). The latter can be computed in closed-form if both $\mathbb{P},\mathbb{Q}$ are Gaussian for instance. 
Moreover, \(B_n(\mathbb{Q})\) represents the bias introduced by the IW regularization, given contexts \((x_i)_{i \in [n]}\); \(B_n(\mathbb{Q}) = 0\) when \(\hat{w}_\theta(x, a) = w(x, a)\) (no IW regularization) and \(B_n(\mathbb{Q}) > 0\) otherwise. The first term in \(\bar{V}_n(\mathbb{Q})\) resembles the theoretical second moment of the regularized IWs \(\hat{w}_\theta(x, a)\) (without the cost) when viewed as random variables, while the second term resembles the empirical second moment of \(\hat{w}_\theta(x, a) c\) (with the cost). If \(\bar{V}_n(\mathbb{Q})\) is bounded (which is the case for all IW regularizations in \cref{sec:regularizations} except \texttt{ES}), we can set \(\lambda = 1/\sqrt{n}\), resulting in a \(\mathcal{O}(1/\sqrt{n} + B_n(\mathbb{Q}))\) bound.

\textbf{Linear vs. Non-linear IW Regularization.} If \(\hat{w}(x, a)\) is linear in \(\pi_{\theta}(x, a)\) (i.e., $g$ linear in its first variable), then \(\hat{R}\) is also linear in \(\pi_\theta\), yielding
\begin{align*}
    \left|\mathbb{E}_{\theta \sim \mathbb{Q}}[R(\pi_{\theta}) - \hat{R}(\pi_{\theta}, S)]\right| = \left|R(\pi_{\mathbb{Q}}) - \hat{R}(\pi_{\mathbb{Q}}, S)\right|\,,
\end{align*}
where we define
\begin{align}\label{eq:pac_bayes_policy}
    \pi_{\mathbb{Q}} = \mathbb{E}_{\theta \sim \mathbb{Q}}[\pi_{\theta}]\,.
\end{align}
This technique is widely used in the literature \citep{london2019bayesian, sakhi2022pac, aouali23a} because it allows translating the bound in \cref{thm:main_result}, which controls \(\left|\mathbb{E}_{\theta \sim \mathbb{Q}}[R(\pi_{\theta}) - \hat{R}(\pi_{\theta}, S)]\right|\), into a bound that controls \(|R(\pi_{\mathbb{Q}}) - \hat{R}(\pi_{\mathbb{Q}}, S)|\), the quantity of interest in OPL. The main requirement is to find linear IW regularizations and policies that satisfy \eqref{eq:pac_bayes_policy}. Fortunately, many IW regularizations, such as \texttt{Clip}, \texttt{IX}, and \texttt{ES} in \eqref{eq:regs}, are linear in \(\pi\), and several practical policies adhere to the formulation in \eqref{eq:pac_bayes_policy}; refer to \citet[Section 4.2]{aouali23a} for an in-depth explanation of such policies, including softmax, mixed-logit, and Gaussian policies. In fact, \citet{sakhi2022pac} demonstrated that any policy can be written as \eqref{eq:pac_bayes_policy}.

In \cref{corr:lin_reg_main}, we specialize \cref{thm:main_result} under linear IW regularizations of the form \(\hat{w}_\theta(x, a) = \frac{\pi_\theta(a|x)}{h(\pi_0(a|x))}\), assuming \(h(\pi_0(a|x)) \geq \pi_0(a|x)\) for any \((x, a) \in \cX \times \cA\). Additionally, we assume that \(\pi_\theta\) is binary, meaning \(\pi_\theta(a \mid x) \in \{0, 1\}\) for any \((x, a) \in \cX \times \cA\). In other words, \(\pi_\theta\) is deterministic, allowing us to use \(\pi_\theta(a | x)^2 = \pi_\theta(a | x)\) for any \((x, a) \in \cX \times \cA\). Essentially, the policies \(\pi_{\mathbb{Q}}\) defined in \eqref{eq:pac_bayes_policy} can be viewed as a mixture of deterministic policies under \(\mathbb{Q}\). Note that this assumption on \(\pi_\theta\) being binary is mild. For instance, policies $ \pi_{\mathbb{Q}}$ that can be written as mixtures of binary $pi_{\theta}$ include softmax, mixed-logit, and Gaussian policies \citep[Section 4.2]{aouali23a}. Under these assumptions, \cref{thm:main_result} yields the following result.


\begin{corollary}\label{corr:lin_reg_main} Assume the regularized IWs can be written as \(\hat{w}_\theta(x, a) = \frac{\pi_\theta(a|x)}{h(\pi_0(a|x))}\) with $h:[0,1]\to \mathbb{R}^+$ verifies $h(p) \geq p$ for any $p \in [0, 1]$. Moreover, for any distribution $\mathbb Q$ in the parameter space $\Theta$, we define $\pi_{\mathbb{Q}} = \mathbb{E}_{\theta \sim \mathbb{Q}}[\pi_{\theta}]$ where $\pi_{\theta}$ is binary. Then, let $\lambda>0$,  $n \ge 1$, $\delta \in (0, 1)$, and let $\mathbb{P}$ be a fixed prior on $\Theta$,
The following inequality holds with probability at least $1-\delta$ for any distribution $\mathbb{Q}$ on $\Theta$
\begin{align}
    \Big|R(&\pi_{\mathbb{Q}})-\hat{R}(\pi_{\mathbb{Q}}, S)\Big|  \\  & \le \sqrt{ \frac{{\textsc{kl}}_{1}(\mathbb{Q})}{2n} } + B_n(\pi_{\mathbb{Q}})  +
\frac{{\textsc{kl}}_{2}(\mathbb{Q})}{n \lambda } + \frac{\lambda}{2}\bar{V}_n(\pi_{\mathbb{Q}})\,,\nonumber
\end{align}
where ${\textsc{kl}}_{1}(\mathbb{Q})$ and  ${\textsc{kl}}_{2}(\mathbb{Q})$ are defined in \cref{thm:main_result}, and
\begin{align*}
    &\bar{V}_n(\pi_{\mathbb{Q}}) = \frac{1}{n}\sum_{i=1}^n  \E{a \sim \pi_0(\cdot | x_i)}{\frac{\pi_{\mathbb{Q}}(a |x_i)}{h(\pi_{0}(a |x_i))^2}} \\
    &\hspace{1.9in}+\frac{\pi_{\mathbb{Q}}(a_i |x_i)}{h(\pi_{0}(a_i |x_i))^2} c_i^2\,,\\
   &B_n(\pi_{\mathbb{Q}}) = 1 - \frac{1}{n} \sum_{i=1}^{n} \sum_{a \in \cA}\pi_0(a | x_i)\frac{\pi_{\mathbb{Q}}(a |x_i)}{h(\pi_{0}(a |x_i))}\,.
\end{align*}
\end{corollary}
The terms in the above bound have similar interpretations to those in \cref{thm:main_result}. The main benefit of \cref{corr:lin_reg_main} is that it eliminates the need for the expectation \(\E{\theta \sim \mathbb{Q}}{\cdot}\), which is now embedded in the definition of policies in \eqref{eq:pac_bayes_policy}. For example, \cref{corr:lin_reg_main} allows us to recover the main result of \texttt{ES} in \citet{aouali23a} when \(h(p) = p^\alpha\), \(\alpha \in [0, 1]\). Similarly, we can apply it to \texttt{IX} \citep{gabbianelli2023importance} by setting \(h(p) = p + \gamma\), \(\gamma \geq 0\), and to \texttt{Clip} \citep{london2019bayesian} by setting \(h(p) = \max(p, \tau)\), \(\tau \in [0, 1]\).






Finally, if \(\hat{w}_\theta(x, a)\) is not linear in \(\pi_{\theta}\), then this technique cannot be used, and the original expectation \(\mathbb{E}_{\theta \sim \mathbb{Q}}[\cdot]\) in \cref{thm:main_result} must be retained.

\textbf{Limitations.} This bound has two main limitations. \textbf{1)} Despite its broad applicability, directly applying \cref{thm:main_result} to bound the suboptimality gap of the learned policy-specifically, to bound \(R(\hat{\pi}_n) - R(\pi_*)\), where \(\pi_* = \argmin_{\pi \in \Pi} R(\pi)\) is the optimal policy and \(\hat{\pi}_n\) is learned by optimizing the bound-is not straightforward. To illustrate, consider the linear IW regularization case in \cref{corr:lin_reg_main} and suppose that \(\pi_*\) can be expressed as \(\pi_* = \pi_{\mathbb{Q}_*}\), where \(\mathbb{Q}_* = \argmin_{\mathbb{Q}} R(\pi_{\mathbb{Q}})\) is the optimal distribution. In this scenario, the suboptimality gap would be bounded by the upper bound in \cref{corr:lin_reg_main}, evaluated at the optimal distribution \(\mathbb{Q} = \mathbb{Q}_*\). However, the scaling of this suboptimality bound with \(n\) is not immediately evident for general IW regularizers and requires individual examination for each IW regularization. This is because the bound contains numerous empirical (data-dependent) terms such as \(B_n(\pi_{\mathbb{Q}_*})\) and \(\bar{V}_n(\pi_{\mathbb{Q}_*})\) that are not easily transformed into data-independent terms that scale as \(\mathcal{O}(1/\sqrt{n})\). Nonetheless, the versatility, tractability, and proven empirical benefits of our bound (\cref{sec:experiments}) make it appealing. \textbf{2)} It has been noted that directly deriving two-sided bounds for IW estimators might be loose because they treat both tails similarly, whereas prior work \citep{gabbianelli2023importance} indicates essential differences between the lower and upper tails, as seen in the \texttt{IX}-estimator \citep{gabbianelli2023importance}. This work directly derives two-sided bounds for general regularized IPS. Investigating whether bounding each side individually could lead to terms that are easier to interpret and solve the above problem is an interesting direction for future research.

\subsection{Pessimistic Learning Principles}\label{subsec:lp}

\cref{thm:main_result} yields two pessimistic learning principles. 

\textbf{Bound Optimization.} First, one can directly learn a $\hat{\pi}_n$ that optimizes the bound of \Cref{thm:main_result} as follows
\begin{align}\label{eq:objective_pac_bayes}
   \argmax_{\mathbb{Q}} \E{\theta \sim \mathbb{Q}}{\hat{R}(\pi_{\theta}, S)} + \sqrt{ \frac{{\textsc{kl}}_{1}(\mathbb{Q})}{2n} } + B_n(\mathbb{Q}) \nonumber \\  +
\frac{{\textsc{kl}}_{2}(\mathbb{Q})}{n \lambda } + \frac{\lambda}{2}\bar{V}_n(\mathbb{Q})\,,
\end{align}
Here, the main challenge is that the objective involves an expectation under $\mathbb{Q}$. Fortunately, the reparameterization trick \citep{kingma2015variational} can be used in this case. This trick allows us to express a gradient of an expectation as an expectation of a gradient, which can then be estimated using the empirical mean (Monte Carlo approximation). In our case, we use the \emph{local} reparameterization trick \citep{kingma2015variational}, known for reducing the variance of stochastic gradients. Specifically, we consider softmax policies $\pi^{\textsc{sof}}_{\theta}(a | x)$ in \eqref{eq:softmax_pac_bayes} and set $\mathbb{Q} =  \mathcal{N}\left(\mu, \sigma^2 I_{dK}\right)$ where $\mu \in \mathbb{R}^{dK}$ and $\sigma > 0$ are learnable parameters. Then, all terms in \eqref{eq:objective_pac_bayes} are of the form $\E{\theta \sim \mathcal{N}\left(\mu, \sigma^2 I_{dK}\right)}{f(\pi^{\textsc{sof}}_{\theta}(a | x))}$ for some function $f$. These terms can be rewritten as
\begin{talign*}
  &\E{\theta \sim \mathcal{N}\left(\mu, \sigma^2 I_{dK}\right)}{f(\pi^{\textsc{sof}}_{\theta}(a | x))} \\
  &= \E{\epsilon \sim \mathcal{N}(0, \|\phi(x)\|_2^2 I_{K})}{f\left(\frac{\exp(\phi(x)^\top \mu_a + \sigma \epsilon_a)}{\sum_{a^\prime \in \cA}\exp(\phi(x)^\top  \mu_{a^\prime} + \sigma \epsilon_{a^\prime})}\right)}\,.
\end{talign*}
This expectation is approximated by generating i.i.d. samples \(\epsilon_i \sim \mathcal{N}(0, \|\phi(x)\|_2^2 I_{K})\) and computing the corresponding empirical mean. The gradients are approximated similarly (\cref{app:bound_opt}). Unfortunately, these techniques can induce high variance when the number of actions \(K\) is large. This can be mitigated by considering linear IW regularizations and optimizing the bound in \cref{corr:lin_reg_main}. However, this technique only works for linear IW regularizations. Therefore, we propose another practical learning principle inspired by our bound in \Cref{thm:main_result}, which enhances performance at the cost of additional hyperparameters.

\textbf{Heuristic Optimization.} The following heuristic avoids the obstacles of directly optimizing the bound, at the cost of introducing some hyperparameters, while still being inspired by \Cref{thm:main_result}. This approach involves minimizing the estimated risk $\hat{R}$, penalized by its associated bias and variance terms from \Cref{thm:main_result}, along with a proximity term to the logging policy $\pi_0 = \pi_{\theta_0}$ such as
\begin{align}\label{eq:learning_principle}
\hspace{-0.2cm} \hat{R}(\pi_{\theta}, S) + \lambda_1 \|\theta - \theta_0\|^2  + \lambda_2  \tilde{V}_n(\pi_{\theta}) + \lambda_3 \tilde{B}_n(\pi_{\theta})\,,
\end{align}
where $\tilde{V}_n(\pi_{\theta})$ and $\tilde{B}_n(\pi_{\theta})$ are the terms inside the expectations in $\bar{V}_n(\mathbb{Q})$ and $B_n(\mathbb{Q})$, respectively, $\theta_0$ is the parameter of $\pi_0$, and $\lambda_1, \lambda_2, \lambda_3$ are tunable hyperparameters.

Both learning principles in \eqref{eq:objective_pac_bayes} and \eqref{eq:learning_principle} are suitable for stochastic gradient descent. They are also generic, enabling the comparison of different IW regularization techniques given a fixed, observed logged data $S$. In \Cref{sec:experiments}, we will empirically compare these two learning principles and evaluate the effect of different IW regularization techniques.


\subsection{Sketch of proof for the main result}
Our goal is to bound $\E{\theta \sim \mathbb{Q}}{R(\pi_{\theta})-\hat{R}(\pi_{\theta}, S)}$. To achieve this, we decompose it into three terms as follows
\begin{align*}
    \mathbb{E}_{\theta \sim \mathbb{Q}}[R(\pi_{\theta})-\hat{R}(\pi_{\theta}, S)] = I_1 + I_2 + I_3\,,
\end{align*}
Next, we explain the terms $I_1$, $I_2$, and $I_3$ and the rationale for their introduction. 

First, $I_1 = \E{\theta \sim \mathbb{Q}}{R(\pi_{\theta}) - \frac{1}{n}\sum_{i=1}^n R(\pi_{\theta} | x_i)}$, where $R(\pi_{\theta} | x_i) = \E{a \sim \pi_{\theta}(\cdot | x_i)}{c(x_i, a)}$, represents the risk given context $x_i$. This term captures the estimation error of the empirical mean of the risk using $n$ i.i.d. contexts $(x_i)_{i \in [n]}$. It is introduced to avoid the intractable expectation over $x \sim \nu$, thereby leading to a tractable bound that can be directly used in our pessimistic learning principle.

Second, $I_2 = \frac{1}{n} \sum_{i=1}^n \E{\theta \sim \mathbb{Q}}{R(\pi_{\theta} | x_i) - \Rg(\pi_{\theta} | x_i)}$, with $\Rg(\pi_{\theta} | x_i) = \E{a \sim \pi_0(\cdot | x_i)}{\hat{w}_\theta(x_i, a)c(x_i, a)}$, represents the expectation of the risk estimator given $x_i$. This term is a bias term conditioned on the contexts $(x_i)_{i \in [n]}$, and its absolute value can be bounded by tractable terms.

Finally, $I_3 = \frac{1}{n}\sum_{i=1}^n \E{\theta \sim \mathbb{Q}}{\Rg(\pi_{\theta} | x_i) - \hat{R}(\pi_{\theta}, S)}$ represents the estimation error of the risk conditioned on the contexts $(x_i)_{i \in [n]}$. This conditioning allows us to avoid the unknown expectation over $x \sim \nu$, making it possible to bound $|I_3|$ by tractable terms.

These terms are bounded as follows.
\textbf{1)} \citet[Theorem~3.3]{alquier2021user} allows bounding $I_1$ with high probability as $|I_1| \leq \sqrt{\frac{{\textsc{kl}}_{1}(\mathbb{Q})}{2n}}$.
\textbf{2)} Using the fact that $|c(x, a)| \leq 1$ for any $(x, a) \in \cX \times \cA$, $|I_2|$ can be bounded as $|I_2| \leq B_n(\mathbb{Q})$.
\textbf{3)} Bounding $|I_3|$ is more challenging. We manage this by expressing the term using martingale difference sequences and adapting \citep[Theorem 2.1]{haddouche2022pac}. Let $(\mathcal{F}_i)_{i \in \{0\} \cup [n]}$ be a filtration adapted to $(S_i)_{i \in [n]}$ where $S_i = (a_\ell)_{\ell \in [i]}$ for any $i \in [n]$. Then define
\begin{align*}
f_i\left(a_i, \pi_{\theta}\right) = \E{a \sim \pi_0(\cdot | x_i)}{\hat{w}_\theta(x_i, a)c(x_i, a)}  \\-\hat{w}_\theta(x_i, a_i)c(x_i, a_i)\,.
\end{align*}
We show that for any $\theta \in \Theta$, $(f_i(a_i, \pi_{\theta}))_{i \in [n]}$ forms a martingale difference sequence, which yields
\begin{align*}
    \left|\E{\theta \sim \mathbb{Q}}{M_n(\theta)}\right| &\leq \frac{{\textsc{kl}}_2(\mathbb{Q})}{\lambda} + n\frac{\lambda}{2}\bar{V}_n(\mathbb{Q})\,,
\end{align*}
with high probability. Recognizing that $\E{\theta \sim \mathbb{Q}}{M_n(\theta)} = nI_3$, we derive the desired inequality
\begin{align*}
  \left|I_3\right| &\leq \frac{{\textsc{kl}}_2(\mathbb{Q})}{n \lambda} + \frac{\lambda}{2}\bar{V}_n(\mathbb{Q})\,.
\end{align*}
Our final bound is obtained by combining the previous inequalities on $|I_1|$, $|I_2|$, and $|I_3|$.

