\section{Background}
\label{sec:setting}
\subsection{Offline Contextual Bandits}\label{subsec:setting}
An agent interacts with a \emph{contextual bandit} environment over $n$ rounds. In round $i \in [n]$, the agent observes a \emph{context} $x_i \sim \nu$, where $\nu$ is a distribution with support $\cX \subseteq \mathbb{R}^d$, a $d$-dimensional \emph{compact context space}. The agent then selects an \emph{action} $a_i$ from a \emph{finite action space} $\cA = [K]$. This action is sampled as $a_i \sim \pi_0(\cdot|x_i)$, where $\pi_0$ is the logging policy used to collect data. Specifically, for a given context $x$, $\pi_0(a|x)$ represents the probability that the agent takes action $a$ under its current (logging) policy. Finally, the agent receives a stochastic cost\footnote{For simplicity, we assume that costs $c \in [-1, 0]$, though this can be easily extended to $c \in [-C, 0]$ for $C>0$. } $c_i \in [-1, 0]$ that depends on the observed context $x_i$ and the action $a_i$. Precisely, $c_i \sim p(\cdot | x_i, a_i)$, where $p(\cdot | x, a)$ is the \emph{cost distribution} of action $a$ in context $x$. The expected cost of action $a$ in context $x$ is given by the \emph{cost function} $c(x, a) = \E{c \sim p(\cdot | x, a)}{c}$. Using an alternative terminology, costs can be defined as the negative of rewards: for any $(x, a) \in \cX \times \cA$, $c(x, a) = -r(x, a)$, where $r : \cX \times \cA \rightarrow [0, 1]$ is the \emph{reward function}. These interactions result in an $n$-sized logged data $S = ( x_i, a_i, c_i)_{i \in [n]}$, where $(x_i, a_i, c_i)$ are i.i.d from $\mu_\pi$, the joint distribution of $(x, a, c)$ defined as $\mu_\pi(x, a, c) = \nu(x)\pi(a | x)p(c | x, a)$ for any $(x, a, c) \in \cX \times \cA \times [-1, 0]$.

Agents are represented by stochastic policies $\pi \in \Pi$, where $\Pi$ denotes the space of policies. Specifically, for a given context $x \in \cX$, $\pi(\cdot | x)$ defines a probability distribution over the action space $\cA$. Then, the performance of a policy $\pi \in \Pi$ is measured by the \emph{risk}, defined as
\begin{align}\label{eq:policy_value}   R(\pi)=  \E{x \sim \nu, a \sim \pi(\cdot | x)}{c(x, a)}\,.\end{align}
%\begin{align}\label{eq:policy_value}   R(\pi) &= \E{x \sim \nu, a \sim \pi(\cdot | x), c \sim p(\cdot \mid x, a)}{c} \nonumber \\    &=  \E{(x, a, c) \sim \mu_\pi}{c}     = \E{x \sim \nu, a \sim \pi(\cdot | x)}{c(x, a)}\,.\end{align}where $\mu_\pi$ is the joint distribution of $(x, a, c)$: $\mu_\pi(x, a, c) = \nu(x)\pi(a | x)p(c | x, a)$ for any $(x, a, c) \in \cX \times \cA \times [-1, 0]$.\imad{I don't think we even use the joint distribution notation.}\ak{yes let's remove it then !}

Given a policy $\pi \in \Pi$ and logged data $S$, the goal of OPE is to design an estimator $\hat{R}(\pi, S)$ for the true risk $R(\pi)$  such that $\hat{R}(\pi, S) \approx R(\pi)$. Leveraging this estimator, OPL aims to find a policy $\hat{\pi}_n \in \Pi$ such that $R(\hat{\pi}_n) \approx \min_{\pi \in \Pi} R(\pi)$. We focus on the IPS estimator \citep{horvitz1952generalization}, which estimates the risk $R(\pi)$ by re-weighting samples using the ratio between $\pi$ and $\pi_0$
\begin{align}\label{eq:ips_policy_value}
    \hat{R}_{\textsc{ips}}(\pi, S) = \frac{1}{n} \sum_{i=1}^n w(x_i, a_i) c_i \,,
\end{align}
where for any $(x, a) \in \cX \times \cA$, \( w(x, a) = \pi(a | x)/\pi_0(a | x) \) are the \textit{importance weights (IWs)}.

\subsection{Regularized IMPORTANCE WEIGHTING}\label{sec:regularizations}
The IPS estimator in \eqref{eq:ips_policy_value} is unbiased when $\pi_0(a|x)=0$ implies that $\pi(a|x)=0$ for all $(x, a) \in \cX \times \cA$. However, its variance scales linearly with the IWs \citep{swaminathan2017off} and can be large if the target policy $\pi$ differs significantly from the logging policy $\pi_0$. To mitigate this effect, it is common to transform the IWs using a regularization function that introduces some bias to reduce variance. Specifically, a regularized IPS estimator is defined as
\begin{align}\label{eq:reg_ips_policy_value}
    \hat{R}(\pi, S) &= \frac{1}{n} \sum_{i=1}^n \hat{w}(x_i, a_i) c_i \,,
\end{align}
where $\hat{w}(x, a)$ are the regularized IWs. Examples of $\hat{w}$ include clipping (\texttt{Clip}) \citep{london2019bayesian}, exponential smoothing (\texttt{ES}) \citep{aouali23a}, implicit exploration (\texttt{IX}) \citep{gabbianelli2023importance}, and harmonic (\texttt{Har}) \citep{metelli2021subgaussian}, defined as
\begin{talign}\label{eq:regs}
    \texttt{Clip}: \qquad &\hat{w}(x, a) = \frac{\pi(a \mid x)}{\max(\pi_0(a \mid x), \tau)}\,, \, \tau \in [0, 1]\,,\\
    \texttt{ES}: \qquad &\hat{w}(x, a) = \frac{\pi(a \mid x)}{\pi_0(a \mid x)^\alpha}\,, \, \alpha \in [0, 1]\,,\nonumber\\
    \texttt{IX}: \qquad &\hat{w}(x, a) = \frac{\pi(a \mid x)}{\pi_0(a \mid x) + \gamma}\,, \, \gamma \in [0, 1]\,,\nonumber\\
    \texttt{Har}: \qquad &\hat{w}(x, a) = \frac{{w}(x, a)}{(1-\lambda){w}(x, a) +\lambda}\,, \, \lambda \in [0, 1]\,.\nonumber
\end{talign}
These regularizations are linear in $\pi$ except \texttt{Har}. Other non-linear regularizations have been proposed \citep{swaminathan2015batch, su2020doubly}, but we will focus on the above examples because their hyperparameters fall within the same range $[0, 1]$, facilitating their comparison.