\maketitle

\begin{abstract}
    While significant progress has been made in designing algorithms that minimize regret in online decision-making, real-world scenarios often introduce additional complexities, with missing outcomes perhaps among the most challenging ones.
    Overlooking this aspect or simply assuming random missingness invariably leads to biased estimates of the rewards and may result in linear regret.
    Despite the practical relevance of this challenge, no rigorous methodology currently exists for systematically handling missingness, especially when the missingness mechanism is not random.
    In this paper, we address this gap in the context of multi-armed bandits (MAB) with missing outcomes by analyzing the impact of different missingness mechanisms on achievable regret bounds.
    We introduce algorithms that account for missingness under both missing at random (MAR) and missing not at random (MNAR) models.
    Through both analytical and simulation studies, we demonstrate the drastic improvements in decision-making by accounting for missingness in these settings.
\end{abstract}

\section{Introduction}
Multi-armed bandit (MAB) algorithms have emerged as indispensable tools for decision-making under uncertainty, balancing the trade-off between exploring different options and exploiting the best-known action.
These algorithms have achieved success in various domains ranging from personalized online advertisement and recommender systems \citep{li2010contextual, xu2020contextual, ban2024neural} to clinical trials \citep{villar2015multi, aziz2021multi, varatharajah2022contextual} and adaptive routing in communication systems \citep{maghsudi2016multi,li2020multi}.
For instance, in online advertising, advertisers need to continuously select which ad to show to a user to maximize click-through rates.
Similarly, in clinical trials, researchers must decide which treatment to administer to patients to optimize recovery rates.
MAB algorithms guide decision-makers in such scenarios to learn actions that minimize regret.
% over time.

Significant progress has been made in developing MAB algorithms that minimize regret in various settings \citep{lai1985asymptotically,Auer2002,bubeck2012regret,lattimore2020bandit,slivkins2019}.
However, the real world often introduces challenges that deviate from the assumptions of the classical MAB framework or its current extensions.
One of the most critical challenges is that of missing outcomes -- situations where the results of certain actions are not always observed.
This challenge arises more often than not in practice and can fundamentally undermine the decision-making process if left unaddressed. To illustrate this, consider an example of a large-scale clinical trial for a new cancer treatment.
Patients are randomly assigned to different treatment arms, and their health outcomes are monitored over time. In practice, not all patients will complete the trial. Some may drop out early due to side effects, while others may stop reporting outcomes for personal reasons, and some could pass away during the trial due to reasons not related to the treatment (competing events).
Crucially, the missingness of the outcome may not be random.
Patients experiencing severe side effects or poor health are more likely to drop out, meaning that the missingness mechanism is correlated with the unobserved outcome itself.
This introduces systematic bias into the estimation of the rewards, and if not accounted for, would lead to poor decision-making.

The issue of missingness is not confined to healthcare.
In a recommendation system that suggests articles to users on an online platform,
if users who find the content irrelevant are less likely to provide feedback (e.g., they leave the site without interacting), the system could overestimate the value of the recommended articles, assuming that the missing feedback is independent of user satisfaction.
Here too, missingness is correlated with the unobserved outcome, leading to biased reward estimates and sub-optimal recommendations.

The problem of missing data is a fundamental challenge in causal inference.
This issue has been extensively studied over the past decades, with seminal works such as  \citep{rubin1976inference, little2019statistical, bang2005doubly} laying the foundation for dealing with biased estimations in the presence of missing data. 
These methods, along with more recent developments in graphical models for handling missing data \citep{mohan2021graphical, nabi2020full}, have become standard approaches in causal inference.
Missing data has also been extensively explored in specific contexts such as instrumental variables, \citep{tchetgen2017general, sun2018semiparametric, kennedy2019handling}
and mediation analysis \citep{zhang2013methods, zhang2015mediation, kidd2023mediation}, among others.
By contrast, the challenge of missing outcomes has received relatively little attention in multi-armed bandit problems, although some progress has been made in related areas.
For instance, the problem of delayed feedback in bandits bears some similarity to our setting, as both involve incomplete information at decision time.
Several works have addressed stochastic bandits with unrestricted delays \citep{joulani2013online, vernade2017stochastic}, and delays dependent on stochastic rewards \citep{pike2018bandits, lancewicki2021stochastic}.
In contextual bandits, \citep{bouneffouf2017context} studied linear contextual bandits with missing (restricted) contexts. 
While this work addresses missing data in bandits, it focuses on missing contexts rather than outcomes and assumes a linear reward model.
Others have explored bandit problems with variable costs or restricted observations.
For example, \citep{ding2013multi, seldin2014prediction} studied MAB problems with variable costs, where the outcome is observable only after paying the associated cost.


There are two lines of research closely related to our work.
The first includes works such as
\citep{chen2022some} and \citep{bouneffouf2020contextual}, which consider the problem of MAB with missing outcomes. 
\citep{chen2022some} settles for some empirical considerations and suggestions, without formally studying the problem or providing tailored algorithms.
\citep{bouneffouf2020contextual} employs unsupervised learning techniques to impute the missing rewards in a contextual bandit setting.
Both of these works assume that the missingness mechanism is random, possibly after conditioning on the context.
In this paper, we do a thorough study from a formal perspective, characterizing the best achievable regret bounds under multiple scenarios with missing outcomes.
We also provide novel regret lower bounds and algorithms that are guaranteed to achieve optimal regrets.
The second line of related research concerns bandits with graph feedback \citep{mannor2011bandits}, where pulling an arm provides feedback about the rewards of other connected arms, where connections are represented by a graph structure.
Typically, these models assume that each arm has a self-loop, ensuring its own reward is always observed \cite{mannor2011bandits, alon2017nonstochastic, li2020stochastic, cortes2020online, dai2024can}.
\citet{esposito2022learning} extended this framework by allowing for missing self-loops, aligning with the missing outcome setting.
However, their model assumes that missingness depends only on the chosen action, whereas we explicitly analyze cases where the missingness mechanism is outcome-dependent.
We are able to achieve unbiased estimates of the expected rewards in this setting through using a mediator variable as auxiliary information.



Addressing the problem of missing outcomes is both practically relevant and theoretically challenging.
In applications such as healthcare, education, and e-commerce, accounting for missing data could lead to better treatment policies, more personalized learning experiences, and more effective product recommendations, potentially affecting millions of individuals.
In this paper, we undertake the first formal study of multi-armed bandits with missing outcomes and provide tailored algorithms that explicitly handle different types of missingness.
Our main contributions are two-fold.
First, we provide an analysis of the impact of missing outcomes on achievable regret (the loss of optimality).
Second, we introduce provably good upper confidence bound (UCB) algorithms that are tailored to handle both missing at random and missing not at random mechanisms.
Our algorithms are designed to adjust reward estimates based on the observed data and the missingness mechanism, ensuring unbiased estimation.
% Contributions and the consequences of our results go here..
Finally, we extend our analysis to settings where not only outcomes but also mediators (e.g., users providing feedback) are prone to missingness, to further broaden the applicability of our approach.

The remainder of this paper is structured as follows.
% We begin by reviewing the most relevant literature in \ref{subsec:related}.
In \Cref{sec:prb}, we review the relevant background and formalize the problem of multi-armed bandits with missing outcomes.
In \Cref{sec:missingoutcome} we present our algorithms in the settings of missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR), respectively.
Additionally, we provide the corresponding achievable regret lower bounds.
The technical proofs are postponed to \Cref{apx:proofs} due to space limitations.
In \Cref{sec:missingmediator}, we extend our approach and present algorithms for the case when the mediator is also prone to missingness.
A discussion of the limitations of our work and our concluding remarks appear in \Cref{sec:conc}.

\section{Formalization and Problem Setup}\label{sec:prb}
We begin by reviewing the classic multi-armed bandit (MAB) setup and then extend it to incorporate missing outcomes.
The MAB problem involves an agent (decision-maker) who interacts with an environment over a sequence of $T$ time steps.
At each time step $t\in\{1,\dots,T\}$, the agent pulls an arm $a_t$ from a set of $n$ available actions indexed by $\mathcal{A}=\{1,\dots,n\}$.
Upon pulling this arm, the agent receives a stochastic reward $Y_t\in\mathcal{Y}$ drawn from a fixed (but unknown) probability distribution associated with arm $a_t$.
The goal of the agent is to minimize the \emph{cumulative regret} over the time horizon $T$, which is defined as the cumulative difference between the rewards of the optimal arm and the chosen arms.
Specifically, let $\mu_a \coloneqq \ex{Y \mid A=a}$ for every $a\in\mathcal{A}$.
The optimal arm, denoted by $a^*$, is the arm that maximizes the expected reward, i.e., 
\(a^*\coloneqq \argmax_{a\in\mathcal{A}}\mu_a.\)
The regret at time $t$ is  defined as
\(R_t \coloneqq \mu_{a^*}-\ex{Y\mid A=a_t}\),
and the cumulative regret over $T$ rounds, denoted by $R_T$, is the sum of the latter instantaneous regrets over the horizon T:
\begin{equation}
\begin{split}
    R_T &\coloneqq \sum_{t=1}^T(\mu_{a^*}-\ex{Y\mid A=a_t}) 
    % \\&= T\mu_{a^*} - \sum_{t=1}^T \ex{Y\mid A=a_t}.
\end{split}
\end{equation}
In the classical setting, it is assumed that after pulling an arm $a_t$, the agent always observes the true reward $Y_t$ without any missingness.
%However, in real-world applications, rewards may be missing, which complicates the learning process.

We extend the classic MAB model to accommodate missingness.
We assume that pulling each arm $a_t\in\mathcal{A}$, draws a stochastic tuple 
$\big(Y_t, \oo_t, M_t, \mm_t \big)$ 
 from a fixed but unknown probability distribution associated with arm $a_t$. In this tuple,
$Y_t\in\mathcal{Y}$ represents the true reward (as before), whereas $\oo_t\in\{0,1\}$ is an indicator denoting whether this reward is observed. 
% $M_t\in\mathcal{M}$ is a possible mediator or an auxiliary variable\footnote{The introduction of this auxiliary variable is without loss of generality, as it can be simply set to $M\equiv 0$, i.e., a degenerate variable that carries no information.}, with $\mm_t\in\{0,1\}$ indicating whether this auxiliary variable is observed.
$M_t\in\mathcal{M}$ is a possible mediator\footnote{Our use of the term `mediator' is broader than in traditional causal inference. In this context, it refers to any auxiliary variable potentially correlated with the reward or the missingness mechanism, not necessarily one on a specific causal pathway. The inclusion of this variable is also without loss of generality, as it can be a degenerate variable ($M_t \equiv 0$) that carries no information.} or an auxiliary variable, with $\mm_t\in\{0,1\}$ indicating whether this auxiliary variable is observed.

For example, in online recommendations, auxiliary information could include metrics such as the time a user spends on a webpage before navigating away, or other data points gathered from browser cookies, such as past browsing behavior, device type, or location.
The agent has access to the `observed' tuple $\big(Y^o_t, \oo_t, M^o_t, \mm_t \big)$,
where the observed values $Y^o_t$ and $M^o_t$ are defined as follows:
\begin{equation}\label{eq:consistency}
\begin{split}
    Y^o_t=\begin{cases}
        Y_t; &\text{ if } \oo_t =1,\\
        ?; &\text{ o.w.}
    \end{cases},
    M^o_t=\begin{cases}
        M_t; &\text{ if } \mm_t =1,\\
        ?; &\text{ o.w.}
    \end{cases},
\end{split}
\end{equation}
where $?$ denotes a missing value.
We define $\mu_{a}$ as the expected value of $Y_t$ given $A_t=a$ as before, with the crucial difference that samples of $Y_t$ are missing when $\oo_t=0$.

Clearly, without imposing further structure, it is not possible to construct unbiased estimators for the expected rewards of each arm. 
In fact, these expectations are not `identifiable,' meaning that they are not uniquely determinable functionals of the probability measure over observable variables.
% \begin{remark}Clearly, without imposing further structure, it is not possible to construct unbiased estimators of the expected rewards.
% In particular, even with infinitely many samples, i.e., with known densities $P(Y^o_t\mid A_t=a)$ for every $a\in\mathcal{A}$, the following bounds are tight for the expected rewards $\ex{Y_t\mid A_t=a}$ in general:
%     \[
%     \begin{split}
%         \ex{Y^o_t\mid A_t=a, \oo_t=1}& P(\oo_t=1) \\&\leq \ex{Y_t\mid A_t=a} \leq\\ \ex{Y^o_t\mid A_t=a, \oo_t=1}&P(\oo_t=1) + \sup_{y\in\mathcal{Y}}yP(\oo_t=0),
%     \end{split}
%     \]
% \end{remark}
In what follows, we begin with the case where the mediator is fully observed ($\mm_t=1$ with probability $1$).
We first consider the case where the missingness mechanism of the outcome is independent of everything else.
% to convey the key features and ideas of our extension of MAB framework. 
Subsequently, we analyze the more realistic cases where this missingness is correlated with the missing outcome $Y_t$.
Later in \ref{sec:missingmediator} we extend our findings further to the case where even the mediator is prone to missingness.

\section{MAB with Missing Outcome}\label{sec:missingoutcome}
Throughout, we assume that the outcomes are not `always missing.'
\begin{assumption}[Positivity]\label{as:pos}
    % There is a positive constant $\gamma$ such that f
    For every action $a\in\mathcal{A}$ and mediator $m\in\mathcal{M}$ ,
    \(\mathbb{P}\big(\oo_t=1\mid M_t=m, A_t=a\big)> 0\).
    Moreover, $\mathbb{P}(M_t \mid A_t)$ is positive everywhere\footnote{With sufficient caution, the second part of this assumption could be omitted. 
    However, we include it here for the sake of simplicity in the presentation.}.
\end{assumption}
% \begin{remark}
%     An immediate corollary of Assumption \ref{as:pos} is that   \(\mathbb{P}(\oo_t=1\mid A_t=a)> 0\)  for every action $a$.
% \end{remark}
Assumption \ref{as:pos} is reasonable as otherwise there exists an arm for which the agent observes no reward samples.
For the rest of this section, we assume that the auxiliary variable $M_t$ is always observed.
% , hence $\mm_t$ is degenerate.


% \subsection{Main Ideas Behind Our Algorithms and Results}
% In this section, we outline the core ideas behind our upper bound algorithms and provide an overview of their proofs to offer clarity and intuition. The main idea of our algorithms is derived from the classic UCB (Upper Confidence Bound) algorithm, which aims to establish an upper bound on the estimation error of \( \mathbb{E}[Y \mid A = a] \) for each arm \( a \) at each time step \( t \). We present the following lemma to explicitly demonstrate the main idea behind our approach:

% % \begin{lemma}
    
% % \end{lemma}


\subsection{Missing Completely At Random ({{\small MCAR}})}
We begin with the case where the outcome missingness mechanism is independent of the other variables (including the outcome itself).
This case is studied for the sake of completeness, and we acknowledge that, unlike the other cases to follow, it can be accommodated by most existing approaches.
\input{figures/Missing_Mechansim}
\begin{assumption}[MCAR]\label{as:mcar}
    The outcome is missing completely at random.
    That is, \(\oo_t\indep(A_t, Y_t, M_t )\) for $t\in\{1,\dots,T\}$.
\end{assumption}
This assumption holds, for instance, when data gets erased by say an independent mechanism such as a power outage.
The graph of Figure~\ref{fig:mcar} represents this missingness mechanism, whereby the missingness indicator $\oo$ is an isolated node.
As there is no information conveyed by the missingness indicator, the missing chunk of the data can be discarded without any need for extra care.
As such, the classic upper confidence bound (UCB) algorithms are expected to achieve (near-)optimal regret.
We formalize these claims through the next two propositions.
For the sake of completeness, we have included the adapted UCB algorithm (Alg.~\ref{alg:mcar_algorithm}) for this scenario in Appendix \ref{apx:alg}.
Let \( \gamma = \mathbb{P}(O_t^Y = 1) \) be the probability of observing the output in each round.



% \input{algorithms/mcar_algorithm}


% The following theorem presents a regret bound for this mechanism. We assume that $\mathbb{E}[Y] \in [0, 1]$ and $Y$ is sub-gaussian and $P(R_Y = 1) \geq \gamma$.

\begin{restatable}{theorem}{theomcarupper} \label{theo:mcar_upper}\text{(MCAR regret guarantee)} Under Assumption\ref{as:mcar}, for every \( \alpha > 1 \), the cumulative regret of the adapted UCB (Alg.~\ref{alg:mcar_algorithm}) is bounded as follows:
\[
\mathbb{E}[R_T] = O\left(\sqrt{\frac{ \alpha n T \log(T)}{\gamma}}\right).
\]
\end{restatable}

The proof of \Cref{theo:mcar_upper}, which provides a regret bound similar to that of the classic UCB algorithm, but adapted to our setting, is included in Appendix \ref{apx:proofs}.
The following result indicates that this regret bound is (near-)optimal.
\begin{restatable}{theorem}{theomcarlower}\label{theo:mcar_lower}
\text{(Minimax lower bound for MCAR)}  
For any policy \( \pi \), there exists an MCAR instance \( \nu \) s.t.
\[
\mathbb{E}[R_T(\pi, \nu)] = \Omega\left( \sqrt{\frac{nT}{\gamma}} \right),
\]
where \( \mathbb{E}[R_T(\pi, \nu)] \) represents the expected regret of policy \( \pi \) in instance \( \nu \).
\end{restatable}


See \Cref{apx:proofs} for the proof of \Cref{theo:mcar_lower} as well as the rest of the results of this paper.
\subsection{Missing At Random (MAR)}\label{subsec:mar}
We now focus on more realistic settings where the missingness mechanism provides information about the missing outcomes. 
This is the case, for instance, when the unsatisfied customers are more likely to leave comments on an online platform, or in health applications, the patients with severe side effects are more likely to drop out of the study.
We first consider the case when missingness is at random, i.e., independent of $Y$ given $M$ and $A$. 
The graphs of Figure~\ref{fig:mar1} and Figure~\ref{fig:mar2} illustrate two possible representations of the MAR mechanism, under which Assumption~\ref{ass:mar_assumption} holds.

\begin{assumption}[MAR]\label{ass:mar_assumption}
\(\oo_t\indep Y_t \mid (A_t, M_t )\) for $t\in\{1,\dots,T\}$.
\end{assumption}
% \ref{ass:mar_assumption} states that the missingness in the outcome is random after conditioning on the action and the mediator (auxiliary) variable.
Under \Cref{ass:mar_assumption}, 
% if $|\mathcal{M}| = k$, 
the expected reward is identifiable as follows:
% \begin{equation}\label{eq:idmar}
%     \begin{split}
%         \mu_a&=\mathbb{E}[Y_t\mid A_t=a] = \sum_{m \in \mathcal{M}} \mathbb{E}[Y_t \mid m, a] p_{m,a}\\
%         &\overset{(a)}{=} \sum_{m \in \mathcal{M}}  \mathbb{E}[Y_t \mid m, a, \oo_t = 1]p_{m,a}\\
%         &\overset{(b)}{=} \sum_{m \in \mathcal{M}}  \mathbb{E}[Y^o_t \mid m, a, \oo_t = 1]p_{m,a}\\
%         &=\mathbb{E}\big[\mathbb{E}[Y^o_t \mid m, a, \oo_t = 1]\mid A_t=a\big],
%     \end{split}
% \end{equation}
\begin{equation}\label{eq:idmar}
    \begin{split}
        \mu_a=\mathbb{E}[Y_t&\mid A_t=a] = \mathbb{E}\big[ \mathbb{E}[Y_t \mid M, a] \mid A_t=a\big]\\
        &\overset{(a)}{=} \mathbb{E}\big[  \mathbb{E}[Y_t \mid M, a, \oo_t = 1]\mid A_t=a\big]\\
        &\overset{(b)}{=} \mathbb{E}\big[  \mathbb{E}[Y^o_t \mid M, a, \oo_t = 1]\mid A_t=a\big],
    \end{split}
\end{equation}
where $(a)$ follows from Assumption \ref{ass:mar_assumption} and $(b)$ holds due to consistency (see Equation \ref{eq:consistency}.)

Accordingly, we will use the following estimator for $\mu_a$:
\begin{equation}\label{eq:marest}\hat{\mu}_a = \frac{1}{\vert T_a\vert}\sum_{t\in T_a}\big(\sum_{m\in\mathcal{M}} (\frac{\mathbbm{1}\{M_{t}=m\}}{\vert T_{m,a,o}\vert}\sum_{t'\in T_{m,a,o}} Y^o_{t'}\big)),\end{equation}
% The estimator based on \ref{eq:idmar} can be constructed by first estimating the observed outcome expectations $\ex{Y_t^o\mid m,a, \oo_t=1}$ and then taking their empirical mean.
where $T_{a}, T_{m,a,o}\subseteq\{1,\dots,T\}$ are the sets of iterations where $A_t=a$, and iterations where $A_t=a$, $M_t=m$ and $\oo_t=1$, respectively.
In what follows, for brevity, we define $p_{m,a}\coloneqq\mathbb{P}(M_t=m\mid A_t=a)$.
% To make the general case clear, we first present an algorithm for minimizing regret when the conditional probabilities $p_{m,a}$ are known.
To build intuition for the general case, we first consider the simplified theoretical setting where the conditional probabilities $p_{m,a}$ are known.
We then adapt our algorithm to the case where these probabilities are unknown.
Recall that \( n=\vert\mathcal{A}\vert \) is the number of arms. 
We assume that \( \mathbb{E}[Y_t \mid m, a] \in [0, 1] \) for all arms and that the reward \( Y_t \) is sub-Gaussian.
% , and \( \vert\mathcal{M}\vert = K \).
% , and that the agent knows the probabilities \( p_{m,a} = \mathbb{P}(M = m \mid a) \) for all \( m \in [K] \) and \( a \in [n] \).
\Cref{alg:mar_algorithm} presents the pseudo-code for the first case.
The algorithm is based on UCB, but with an initial step where the agent pulls each arm \( \log(T)^2 \) times.
% at the beginning of the algorithm.
% This is to ensure that every arm is explored  
At the subsequent rounds, both the expected rewards and the associated confidence bounds are estimated based on \Cref{eq:marest}.
% Then, at each subsequent round, the agent applies a modified version of the Upper Confidence Bound (UCB) algorithm, based on Hoeffding's inequality.
In order to present the regret bounds, we need the following definitions.
% The following theorem presents the regret bound for this algorithm. 
Let \( P_a = \sum_{m \in \mathcal{M}} \frac{p_{m,a}}{\gamma_{m,a}} \) where \( \gamma_{m, a} = \mathbb{P}(O^Y = 1 \mid m, a)\).
Further, define \( S\) and $H$ as the arithmetic mean and the harmonic mean of the \( P_a \) values, respectively: 
\[ S \coloneqq \frac{1}{\vert \mathcal{A}\vert} \sum\limits_{a\in\mathcal{A}} P_a,\quad H\coloneqq \frac{\vert \mathcal{A}\vert}{\sum_{a\in\mathcal{A}}\frac{1}{P_a}}. \]
\begin{restatable}{theorem}{theomarupperfirst} \label{theo:mar_upper_1}
\text{(Regret guarantee for Alg.~\ref{alg:mar_algorithm})} Under Assumption \ref{ass:mar_assumption},
for every \( \alpha > 1 \), there exists a constant $c$ such that the following regret bound holds for $T \ge c$:

\[
\mathbb{E}[R_T] = O\left( \sqrt{\alpha T \log(T) n S} \right).
\]
\end{restatable}

Next, we show that \Cref{alg:mar_algorithm} can be adapted to the case where the conditional probabilities $p_{m,a}$ are not known and must be estimated -- see Algorithm~\ref{alg:mar_algorithm2}.
% for details.
% Appendix \ref{apx:alg}), which operates under the same assumptions as before, except that the agent does not know the probabilities \( p_{m,a} = \mathbb{P}(m \mid a) \).
% The algorithm is similar to the previous one, with a slightly modified UCB.
The following theorem shows that this algorithm achieves the same regret bound as \Cref{alg:mar_algorithm}, i.e., the estimation of $p_{m,a}$ does not affect the cumulative regret.

\begin{restatable}{theorem}{theomaruppersecond} \label{theo:mar_upper_2}
\text{(Regret guarantee for Alg.~\ref{alg:mar_algorithm2})}  
Under Assumptions \ref{ass:mar_assumption}, for every \( \alpha > 1 \), there exists a constant $c$ such that the following regret bound holds for $T \ge c$:
\[
\mathbb{E}[R_T] = O\left( \sqrt{\alpha T \log(T) n S} \right).
\]
\end{restatable}
% The proofs for Theorem~\ref{theo:mar_upper_1} and Theorem~\ref{theo:mar_upper_2} are provided in Appendix~\ref{apx:proofs}.
% \begin{remark}

 Since $P_a$ is a weighted average of $1/\gamma_{m, a}$ over $m \in \mathcal{M}$ (with weights $p_{m, a}$), the regret bounds depend not on the cardinality of the mediator set, $|\mathcal{M}|$, but rather on the heterogeneity of the $\gamma_{m, a}$ values.
    % Moreover, with \( \gamma_\mathrm{min} = \min\limits_{m, a} \gamma_{m, a} \), \ref{theo:mar_upper_2} implies the following bound:
    % \[
    % P_a \leq \frac{1}{\gamma_\mathrm{min}}, \quad \mathbb{E}[R_T] = O\left( \sqrt{\frac{\alpha T \log(T) n}{\gamma_\mathrm{min}}} \right).
    % \]
% \end{remark}
%%%%%%%%%%%%%%%%%%%%%%
The following theorem provides the minimax lower bound, demonstrating near-optimality of Algorithms \ref{alg:mar_algorithm} and \ref{alg:mar_algorithm2}.

\begin{restatable}{theorem}{theomarlower} \label{theo:mar_lower_1}
\text{(Minimax lower bound for MAR)}  
For any policy $\pi$, there exists a MAR instance $\nu$ such that:
\[
\mathbb{E}[R_T(\pi, \nu)] = \Omega\left( \sqrt{T n H}\right).
\]
% where $H$ is the harmonic mean of $P_a$s: $H=\frac{n}{\sum\limits_{a} \frac{1}{P_a}}$.
\end{restatable}
Note that when $\gamma_{m,a}$ values are identical (and equal to $\gamma$) then $S$ and $H$ coincide.
Further, 
% Furthermore, if \( \gamma_{m, a} = \gamma \) for all \( m, a \), applying Theorem~\ref{theo:mar_lower_1} gives the following lower bound, showing that the upper and lower bounds match:
% \[
% \mathbb{E}[R_T(\pi, \nu)] = \Omega\left( \sqrt{\frac{Tn}{\gamma}} \right).
% \]
the upper and lower bounds in this case match those of MCAR.
% -- see Theorems \ref{theo:mcar_upper} and \ref{theo:mcar_lower}.

A special case of the MAR environment (depicted in Figure~\ref{fig:semi_mcar}) pertains to when there is no mediator. In this case,  \Cref{ass:mar_assumption} reduces to the following:
\begin{assumption}\label{as:mar2}
    $\oo_t\indep Y_t\mid A_t$ for all $t\in\{1,\dots,T\}$.
\end{assumption}
% This can also be thought of as a generalization of the MCAR environment that does not involve a mediator.
Theorems~\ref{theo:mar_upper_1} and \ref{theo:mar_lower_1}
with a degenerate mediator ($\vert\mathcal{M}\vert=1$) imply the following corollary.
\begin{corollary}
    Under \Cref{as:mar2}, \Cref{alg:mar_algorithm2} induces cumulative regret
    \(\
        \mathbb{E}[R_T] = O\left( \sqrt{\alpha T \log(T) n S} \right)
    \)
    and the cumulative regret of any policy is lower bounded by
    \(
        \mathbb{E}[R_T] = \Omega\left( \sqrt{T n H} \right),
    \)
    where \( S = \frac{\sum\limits_{a} \frac{1}{\gamma_a}}{n} \) and \( H = \frac{n}{\sum\limits_{a} \gamma_a} \).
\end{corollary}


\input{figures/semi_MCAR}

\paragraph{Discussion 1.}\label{discussion1} We used estimators that explicitly use the mediator values in this section.
As we pointed out earlier, the size of $\mathcal{M}$ (the alphabet of $M$) does not affect the regret bounds.
However, one might wonder whether the use of the mediator can be avoided, resulting in simpler algorithms and/or estimation schemes.
We show next that any such algorithm can induce linear regret in the worst case.
As a corollary, this result implies that naïvely employing the classical UCB algorithm also induces linear regret.
\begin{restatable}{theorem}{thmignoremed}\label{thm:ignoremed}
    For any mediator-agnostic  policy $\pi$ (a policy that does not have access to mediator values), there exists a MAR instance $\nu$ which satisfies \Cref{ass:mar_assumption} and its regret grows linearly
    \[
        \ex{R_T(\pi,\nu)}=\Omega(T).
    \]
\end{restatable}

\paragraph{Discussion 2.}\label{discussion2} The expected reward $\mu_a$ can also be estimated using a Horvitz-Thompson (HT) type estimator \citep{HorvitzThompson1952}.
Specifically, the conditional expectation terms in \Cref{eq:idmar} can be expressed as follows:
\[
\begin{split}
    \ex{Y^o_t\mid m,a, \oo_t&=1}=\\
    &
    \ex{\frac{Y^o_t \mathbbm{1}\{M_t=m, \oo_t=1\}}{p_{m,a}\gamma_{m,a}}\mid A_t=a},
\end{split}
\]
and after plugging it into Eq.~\eqref{eq:idmar}, 
\begin{equation}\label{eq:ht}
    \mu_a = \ex{\sum_{m\in\mathcal{M}}
        \frac{Y^o_t\mathbbm{1}\{M_t=m, \oo_t=1\}}{\gamma_{m,a}}
        \mid A_t=a
    }.
\end{equation}

An estimator based on the latter does not require estimating the conditional outcome means (in contrast to Eq.~\ref{eq:idmar}), but it rather needs the estimates of the missingness probability $\gamma_{m,a}$.
Using such an estimator is particularly beneficial when the missingness probabilities are known in advance, or a parametric model can be justified.
However, if the missingness probabilities are small or estimated imprecisely, the HT estimator can exhibit high variance, leading to instability.
One can take a step further and construct augmented inverse propensity weighted (AIPW) estimators for $\mu_a$:
\[
\begin{split} \label{eq:dr}
    \mu_a &= \mathbb{E}\big[
        \sum_{m\in\mathcal{M}}\frac{\mathbbm{1}\{M_t=m\}}{\gamma_{m,a}}\big(
        Y^o_t\mathbbm{1}\{\oo_t=1\}
        -\\&
        (\mathbbm{1}\{\oo_t=1\}-\gamma_{m,a})\ex{Y^o_t\mid m, a, \oo_t=1}
        \big)
        \mid A_t=a
    \big],
\end{split}
\]
which is doubly robust (DR) in the sense that it is consistent as long as either the missingness probabilities $\gamma_{m,a}$ or the conditional outcome means $\ex{Y^o_t\mid m, a, \oo_t=1}$ (but not necessarily both) can be consistently estimated.
We prove this claim formally in Appendix \ref{apx:proofs} for the sake of completeness.
In this paper, we consider discrete-valued mediators, and estimate all the quantities of interest through empirical means. 
Therefore, all three estimators (outcome-based, HT, and DR) coincide.
However, the HT and DR estimators can prove beneficial for extending our approach to incorporate continuous mediators, or in problems with high-dimensional actions and/or mediators where (semi)parametric models can help improve estimation efficiency.
% Despite several appealing asymptotic properties of these estimators, they may underperform compared to the estimator of Eq.~\ref{eq:idmar} when applied to finite samples \citep{}.
% Therefore, in this manuscript, we include the IPW and AIPW estimators solely for empirical evaluations.
\subsection{Missing Not At Random (MNAR)}\label{sec:mnar}
Finally, we consider the case where the missingness mechanism directly depends on the outcome value \( Y \). 
% In this setting, missingness is independent of \( M \) when conditioned on \( Y \) and \( A \). Figure~\ref{fig:mnar} illustrates the MNAR mechanism, which follows Assumption~\ref{ass:mnar_assumption}.
Here, we follow the identification strategy of \citep{zuo2024mediation} for MNAR.
However, we are interested only in identifying the expected rewards, rather than conducting mediation analysis.
We begin with the following assumption.
\begin{assumption}[MNAR]\label{ass:mnar_assumption}
\(
    \oo_t \indep M_t \mid (A_t, Y_t)
    \) for \( t \in \{1, \dots, T\} \).
\end{assumption}
In other words, the missingness is independent of the mediator when conditioned on the action and the actual outcome.
\Cref{fig:mnar} graphically represents this scenario.
% \ref{ass:mnar_assumption} states that the missingness in the outcome is random after conditioning on the action and outcome. 
This situation commonly arises in environments where the reward is missing due to its value. 
For example, if the outcome of interest is the income of an individual, they may not be inclined to report it if the value is too high or too low.

% Following the approach of \citep{zuo2024mediation}, we further make the following assumption.
We further make the following assumption, which is the minimal assumption required for identifiability.
\begin{assumption}[Completeness]\label{as:complete}
    The distribution \( \mathbb{P}(M, Y, O^Y = 1 \mid a) \) is complete in \( M \), that is, for any $a\in\mathcal{A}$, and for any function $g:\mathcal{Y}\to\mathbb{R}$,
    % denoting \( \mathbb{P}(M, Y, O^Y = 1 \mid A) = f(Y, M) \), the equation
    \[
    \int_{y \in \mathcal{Y}} \mathbb{P}(M, Y=y, O^Y = 1 \mid a)g(y) \, dy = 0
    \]
    implies that \( g(Y) = 0 \) with probability one.
\end{assumption}

% A corollary of Assumption~\ref{as:complete} is that the equation
% \(
% \int_{y \in \mathcal{Y}} f(Y, M)g(Y) \, dy = C_M
% \)
% also has a unique solution.


% We prove that \( \mathbb{E}[Y(a)] \) is identifiable if \( \mathbb{P}(M, Y, O^Y = 1) \) is complete (in the sense of \citep{zuo2024mediation}) in \( M \).
Below we show how $\mu_a$ is identified under these assumptions.
The identification strategy outlined here is akin to \citep{zuo2024mediation}.
% First,
% \begin{align*}
%     &\mathbb{P}(m, y, O^Y = 0 \mid a) 
%     % = \mathbb{P}(O^Y = 0 \mid a, m, y) \mathbb{P}(m, y \mid a) 
%     \\
%     &\overset{(a)}{=} \mathbb{P}(m, y, O^Y = 1 \mid a) \times \frac{\mathbb{P}(O^Y = 0 \mid y,a,m)}{\mathbb{P}(O^Y = 1 \mid y,a,m)}\\
%     % &= \mathbb{P}(O^Y = 0 \mid a, y) \times \frac{\mathbb{P}(m, y, O^Y = 1 \mid a)}{\mathbb{P}(O^Y = 1 \mid a, m, y)} \\
%     &\overset{(b)}{=} \mathbb{P}(m, y, O^Y = 1 \mid a) \times \frac{\mathbb{P}(O^Y = 0 \mid y,a)}{\mathbb{P}(O^Y = 1 \mid y,a)},
% \end{align*}
% 
% Thus:
\begin{align*}
    &\mathbb{P}(m, O^Y = 0 \mid a) 
    = \int_{y \in \mathcal{Y}} \mathbb{P}(m, y, O^Y = 0 \mid a) \, dy \\
    &\overset{(a)}{=} \int_{y \in \mathcal{Y}} \mathbb{P}(m, y, O^Y = 1 \mid a) \frac{\mathbb{P}(O^Y = 0 \mid y, a,m)}{\mathbb{P}(O^Y = 1 \mid y, a,m)} \, dy\\
    &\overset{(b)}{=} \int_{y \in \mathcal{Y}} \mathbb{P}(m, y, O^Y = 1 \mid a) \frac{\mathbb{P}(O^Y = 0 \mid y, a)}{\mathbb{P}(O^Y = 1 \mid y, a)} \, dy,
\end{align*}
where $(a)$ and $(b)$ follow from Bayes' rule and \Cref{ass:mnar_assumption}, respectively.
Since \( \mathbb{P}(M, Y, O^Y = 1 \mid a) \) is complete in \( M \), solving this integral equation uniquely determines the inverse odds ratio \( \mathrm{OR}_{y,a}=\frac{\mathbb{P}(O^Y = 0 \mid y, a)}{\mathbb{P}(O^Y = 1 \mid y, a)} \), allowing us to identify \( \mathbb{P}(y\mid a) \) as follows:
\begin{equation}\label{eq:idmnar}
\begin{split}
\mathbb{P}(y\mid a) &=\! \sum_{m\in\mathcal{M}}
\mathbb{P}(y, m \mid a) = 
\sum_{m\in\mathcal{M}}\frac{\mathbb{P}(y, m \mid O^Y = 1, a)}{\mathbb{P}(O^Y = 1 \mid y, a)}
\\&=\! \sum_{m\in\mathcal{M}} (1+\mathrm{OR}_{y,a})\mathbb{P}(y, m \mid O^Y = 1, a).
\end{split}
\end{equation}
% Finally, \( \mathbb{P}(y \mid a) \) is identifiable since:
% \[
% \mathbb{P}(y \mid a) = \sum\limits_{m \in \mathcal{M}} \mathbb{P}(y, m \mid a),
% \]
Finally, \( \mu_a=\mathbb{E}[Y_t\mid A_t=a] \) is identified as
\(
\mu_a = \int_{y \in \mathcal{Y}} y \mathbb{P}(y \mid a) \, dy.
\)

In the remainder of this section, 
we assume \( Y \) is discrete with \( |\mathcal{Y}| = L \), 
and the outcomes are normalized so that \( \sum_{y \in \mathcal{Y}} |y| = 1 \).
Define \( K=\vert\mathcal{M}\vert\), and \( \Theta_a = [\mathbb{P}(m, y, O^Y = 1\mid a)]_{K \times L} \).
Additionally, we assume that these matrices are not ill-conditioned.\footnote{A problem is considered ill-conditioned if small changes to the input can cause large changes in the output solution. Bounding the condition number ensures the problem is well-conditioned.}
% Additionally, the agent is assumed to know an upper bound on the infinity norm of the inverse of the matrix :

\begin{restatable}{assumption}{asbounded}[Bounded condition number]\label{ass:mnar_K_assumption}
For each arm \( a \in \mathcal{A} \), the condition number of \( \Theta_a \) is bounded by:
\[
    \kappa(\Theta_a) \leq C_a,
\]
where \( \kappa(\Theta_a) \) denotes the condition number of \( \Theta_a \) with respect to $\infty$-norm, defined as 
\(
    \kappa(\Theta_a) = \lVert \Theta_a \rVert_\infty \lVert \Theta_a^{\dagger} \rVert_\infty,
\) with $\Theta_a^{\dagger}$ being the pseudo-inverse of $\Theta_a$.
\end{restatable}

We present Algorithm~\ref{alg:mnar_algorithm} for minimizing cumulative regret under the MNAR assumptions. 
The key intuition behind this algorithm is to construct an estimator based on Eq.~\eqref{eq:idmnar} and build upper confidence bounds under \Cref{ass:mnar_K_assumption}.
% In this setting, 
In order to present the regret bound of this algorithm, 
% Using \ref{ass:mnar_K_assumption}, we derive the following regret bound for Algorithm~\ref{alg:mnar_algorithm}. 
define \( p_{y, a} = \mathbb{P}(Y = y \mid A = a)\), and \(\gamma_{y, a} = \mathbb{P}(O^Y = 1 \mid Y = y, A = a)\).


\begin{restatable}{theorem}{theomnarupper} \label{theo:mnar_upper}
\text{(Regret guarantee for Alg.~\ref{alg:mnar_algorithm})}  Under Assumptions \ref{ass:mnar_assumption}, \ref{as:complete}, and \ref{ass:mnar_K_assumption},
for every \( \alpha > 1 \), there exists a constant $c$ such that the following regret bound holds for $T \ge c$:
\[
\mathbb{E}[R_T] = O\Big( \sqrt{\alpha T \log(T) \sum\limits_{a} S_a^2} \Big),
\]
with \( S_a \!=\! \max \{ \frac{L C_a}{\gamma_a \lVert \Theta_a \rVert_\infty },\frac{K}{\gamma_a \sqrt{\sum\limits_{y \in \mathbb{Y}} p_{y, a} \gamma_{y, a}}}
\} \),\( \gamma_a \!=\! \min\limits_{y} \gamma_{y, a} \).
\end{restatable}

\begin{remark}
With \( \gamma_\mathrm{min} = \min\limits_{y, a} \gamma_{y, a} \) and \( \kappa_\mathrm{max} = \max\limits_{a} \frac{C_a}{\lVert \Theta_a \rVert_\infty } \), \Cref{theo:mnar_upper} implies the following bound:
\[
\mathbb{E}[R_T] = O\Big( \sqrt{\alpha T \log(T) N 
\max \{ \frac{L \kappa_\mathrm{max}}{\gamma_\mathrm{min}},\frac{K}{
\gamma_{\mathrm{min}}^{3/2}}}\}^2 \Big).
\]

\end{remark}



% The proofs for Theorem~\ref{theo:mnar_upper} is provided in Appendix~\ref{apx:proofs}. Also i
% In Appendix \textcolor{red}{todo}, we show that without \ref{ass:mnar_K_assumption}, the agent cannot establish a regret bound... \textcolor{red}{add to appendix.}




\section{MAB with Missing Outcome and Mediator}\label{sec:missingmediator}
So far we considered cases where the mediator was fully observable.
We now discuss how our results extend to scenarios involving missing data in both \( Y \) and \( M \).
% We categorize these
% We categorize our analysis into two parts: one for Missing at Random (MAR) and Missing Not at Random (MNAR). 
Here, we assume that the outcome is MAR, and discuss the cases where the mediator is MAR and MNAR separately.
For the case where both outcome and mediator are MNAR, refer to \Cref{apx:Missing M}.
% Despite the different mechanisms, both scenarios yield analogous regret bounds to our earlier results. 
We begin by outlining each scenario, providing identification schemes and estimators for \( \mu_a \). 
The corresponding algorithms, theoretical results, and proofs are postponed to Appendix~\ref{apx:Missing M}.
\input{figures/Missing_Mech_M_Y}
Throughout this section, we work under \Cref{ass:mar_assumption}.

\subsection{MAR}
% In this scenario, the missingness of \( Y \) is independent of its value given the action \( A_t \). Figure~\ref{fig:m,y mar} illustrates the MAR mechanism, which adheres to Assumption~\ref{ass:mar_my_assumption}.
% We work under \ref{ass:mar_assumption}.
% However, s
Since the mediator values are missing, neither the conditional outcome means nor the probabilities $p_{m,a}$ are identifiable.
We require further structure to make progress.
% We propose two alternatives here.
One such structure is
% The first scenario arises
when the mediator missingness can be assumed to be at random, i.e., $\mm_t\indep(M_t,Y_t,\oo_t)\mid A_t$.
This assumption is valid for instance when the missingness mechanism for the mediator depends only on the action. 
A less stringent alternative can be formalized as:
% implies a specific case of MAR (\ref{as:mar2}), where the mediator can be safely ignored and Alg.~\ref{alg:mcar_algorithm} is optimal.
% The other alternative is when the following holds.
\begin{assumption}\label{ass:mar_my_assumption} 
    \(
    \mm_t \indep M_t \mid A_t,
    \) and $\mm_t\indep Y_t\mid (A_t,M_t,\oo_t)$ for all $t\in\{1,\dots,T\} $.
\end{assumption}
See Fig.~\ref{fig:m,y mar} for a graph representation satisfying Assumptions \ref{ass:mar_assumption} and \ref{ass:mar_my_assumption}.
Under these two assumptions, analogous to Eq.~\eqref{eq:idmar}, $\mu_a$ can be identified as follows.
\[
    \begin{split}
        &\mu_a
        = \mathbb{E}\big[  \mathbb{E}[Y_t \mid M, a, \oo_t = 1]\mid A_t=a\big]\\
        &= \mathbb{E}\big[  \mathbb{E}[Y^o_t \mid M, a, \oo_t = 1,\mm_t=1]\mid A_t=a,\mm_t=1\big],
        \end{split}
\]
where the second equation is due to \Cref{ass:mar_my_assumption}.
% where \( n_a \) is the number of times arm \( a \) was pulled and the reward was observed, and \( Y_a(1), \dots, Y_a(n_a) \) are the observed values of \( Y \) corresponding to those pulls. This approach closely resembles the process illustrated in Figure~\ref{fig:semi_mcar}.

\subsection{MNAR}
When the mediator is missing not at random, stronger assumptions are necessary to identify the expected rewards. 
Analogous to Section \ref{sec:mnar}, we will use a completeness assumption.
Here too, the identification strategy follows the approach of \citep{zuo2024mediation}.
% , we make the following two assumption.
% We make Assumption \ref{}
% In this scenario, the missingness of \( Y \) is independent of its value given \( (M_t, A_t) \). Also, we will have \( Y, O^Y, O^M \) are mutually independent given \( (M_t, A_t) \). Figure~\ref{fig:m,y mnar} illustrates the MNAR (i) mechanism, which follows Assumption~\ref{ass:mnar_my_assumption}.
\begin{assumption}\label{ass:mnar_my_assumption} 
    $Y_t$, $\oo_t$ and $\mm_t$ are mutually independent conditioned on $(A_t,M_t)$ for all $t\in\{1,\dots,T\}$.
    % \begin{enumerate}
    %     \item \( Y_t \indep O^Y_t \mid (A_t, M_t) \),
    %     \item \( Y_t \indep O^M_t \mid (A_t, M_t) \),
    %     \item \( O^Y_t \indep O^M_t \mid (A_t, M_t) \).
    % \end{enumerate}
\end{assumption}
\begin{assumption}\label{as:complete2}
    For every $a\in\mathcal{A},m\in\mathcal{M}$, $\mathbb{P}(M = m, Y, O^M = 1, O^Y = 1 \mid a)$ is complete in $Y$.
    That is, for any function $g:\mathcal{Y}\to \mathbb{R}$,
    \[\int_{\mathcal{Y}}\mathbb{P}(M = m, Y=y, O^M = 1, O^Y = 1 \mid a) g(y)dy=0\]
    implies $g(Y)=0$ with probability one.
\end{assumption}
Under Assumption~\ref{ass:mnar_my_assumption}, \(\mu_a\) can be expressed as:
\begin{align*}
    \mu_a &
    = \sum_{m \in \mathbb{M}}  \mathbb{E}[Y \mid a, m]p_{m,a} \\
    &= \sum_{m \in \mathbb{M}} \mathbb{E}[Y \mid a, m, O^Y = 1, O^M = 1]p_{m,a}.
\end{align*}

To proceed, we need to identify \( p_{m,a}=\mathbb{P}(M = m \mid A = a) \). 
This is achieved through \Cref{as:complete2}:
% when \( \mathbb{P}(M = m, Y = y, O^M = 1, O^Y = 1 \mid a) \) is complete in \( \mathbb{Y} \) for every \( a \) and \( m \).
% We derive the following:
\begin{align*}
    &\mathbb{P}(Y = y, O^M = 0, O^Y = 1 \mid a) \\
    &= \sum_{m \in \mathbb{M}} \mathbb{P}(M = m, Y = y, O^M = 0, O^Y = 1 \mid a) \\
    &= \sum_{m \in \mathbb{M}} \mathbb{P}(M = m, Y = y, O^M = 1, O^Y = 1 \mid a) \\
    &\quad \times \frac{\mathbb{P}(O^M = 0 \mid M = m, A = a)}{\mathbb{P}(O^M = 1 \mid M = m, A = a)},
\end{align*}
where we used \Cref{ass:mnar_my_assumption} in the last equation.
By \Cref{as:complete2}, the inverse odds ratios $\mathrm{OR}_{m,a}=\frac{\mathbb{P}(O^M = 0 \mid m, a)}{\mathbb{P}(O^M = 1 \mid m, a)}$ can be uniquely determined.
Finally,
% By leveraging the completeness of \( \mathbb{P}(M = m, Y = y, O^M = 1, O^Y = 1 \mid a) \), we can identify \( \mathbb{P}(O^M = 1 \mid M = m, A = a) \). Hence, we obtain:
\[
\begin{split}
    p_{m,a} &= \frac{\mathbb{P}(M = m, O^M = 1 \mid A = a)}{\mathbb{P}(O^M = 1 \mid A = a, M = m)},\\
    &=(1+\mathrm{OR}_{m,a})\mathbb{P}(M = m, O^M = 1 \mid A = a).
\end{split}
\] 
We use a two-step estimation process, whereby in the first step, $\hat{p}_{m,a}$ is estimated, and in the second step, the expected reward is estimated as
% The estimation process begins by identifying \( p_{m, a} = \mathbb{P}(M = m \mid A = a) \) using above formula. Once \( p_{m, a} \) is identified, \( \mu_a \) can be estimated as:
\[
\hat{\mu}_a = \sum_{m \in \mathcal{M}} \hat{p}_{m, a} \hat{\mu}_{m, a}
\]
where \( \hat{\mu}_{m, a} \) is the empirical mean of the samples \( Y_t \), obtained after pulling arm \( a \), conditioned on \( M_t=m \) with both \( O^M = 1 \) and \( O^Y = 1 \).
Here, we require \( Y_t \) to be finite-valued, analogous to Section \ref{sec:missingoutcome}.

\section{Summary of Assumptions and Results}
To facilitate comparison, the following table summarizes the core assumptions and regret bounds for the MCAR, MAR, and MNAR frameworks under the setting where the mediator variable, \( M \), is always observed.

\input{Tables/summary_table}

\section{Empirical Evaluation}
\label{sec:empirical-evaluation}

Here, we provide an empirical evaluation of our MAB algorithms across different missing data scenarios -- MCAR, MAR(i,ii), and MNAR. 
All our simulations were run on Google Colab\footnote{\href{https://colab.google}{https://colab.google}} with Intel Xeon CPUs. 
Python implementations for reproducing the results of this paper are available on GitHub\footnote{\href{https://github.com/ilia-mahrooghi/Multi-armed-Bandits-with-Missing-Outcome}{https://github.com/ilia-mahrooghi/Multi-armed-Bandits-with-Missing-Outcome}}.

Python code to reproduce our results is attached as supplementary material.
We model the MAB environment in all the aforementioned settings with \( n = 10 \) arms.
More comprehensive simulation results are provided in Appendix \ref{appendix:additional_simulations}.
% The goal of these experiments is to demonstrate that the algorithms' performance in practice aligns with the theoretical results established earlier. By analyzing the cumulative regret under each missing data mechanism, we aim to show how the nature of missing data affects the algorithm's ability to make optimal decisions in environments with incomplete reward information.

\subsection{Experiment Setup of MCAR}
Each arm \( a \in \{1, \dots, n\} \) has an associated mean reward \( \mu_a \), sampled independently from a uniform distribution over the interval \([0, 1]\).
% , i.e., \( \mu_a \sim \text{Uniform}(0, 1) \). 
The observation probability \( \gamma \)
% , which is the probability that a reward is observed after pulling an arm, 
is randomly drawn from a uniform distribution over \([0.5, 1.0]\).
% , such that \( \gamma \sim \text{Uniform}(0.5, 1.0) \). This probability remains constant for all arms and defines the likelihood that a reward is observed after pulling any arm.
At each time \( t \), when arm \( a \) is pulled, the reward \( Y_t \) is generated from a normal distribution \( \mathcal{N}(\mu_a, 1) \).
% , where \( \mu_i \) is the true mean of the selected arm and the variance is fixed at 1. 
%The observed reward is available with probability \( \gamma \); otherwise, the reward is not observed (i.e., missing).
Algorithm~\ref{alg:mcar_algorithm}'s performance is reported across 20 independent runs in the MCAR environment over a time horizon of \( T = 10{,}000 \) iterations, with a fixed parameter \( \alpha = 2 \). Figure~\ref{fig:MCAR-1} depicts the cumulative regret for different \( \gamma \) values. As expected, when \( \gamma \) decreases, the regret grows more rapidly as a consequence of lower observation likelihood.
% \textcolor{red}{TODO: refer to figure, mention the effect of $\gamma$}

% \begin{figure}[h]
% \vspace{-.1in}
% \centerline{\includegraphics[width=0.45\textwidth]{figures/MCAR-1.png}}
% \vspace{-.1in}
% \caption{Cumulative regret of the MCAR algorithm tested on the MCAR bandit environment with various \( \gamma \) values.}
% \end{figure}

\begin{figure*}[t]
    \centering
    \begin{subfigure}[b]{0.31\textwidth}
        \includegraphics[width=\textwidth]{figures/MCAR-1.png}
        \caption{MCAR algorithm on MCAR bandit with various \( \gamma \) values.}
        \label{fig:MCAR-1}
    \end{subfigure}
    \hfill
    \begin{subfigure}[b]{0.31\textwidth}
        \includegraphics[width=\textwidth]{figures/MAR-3.png}
        \caption{MAR algorithms with known and unknown p for comparison.}
        \label{fig:MAR-1}
    \end{subfigure}
    \hfill
    % \vspace{0.3cm} % Adds vertical space between rows
    \begin{subfigure}[b]{0.31\textwidth}
        \includegraphics[width=\textwidth]{figures/MAR-1.png}
        \caption{MAR algorithm with different p initializations on MAR environment.}
        \label{fig:MAR-2}
    \end{subfigure}
    % \hfill
    \begin{subfigure}[b]{0.31\textwidth}
        \includegraphics[width=\textwidth]{figures/MAR-2.png}
        \caption{MAR and UCB algorithms in the MAR bandit environment.}
        \label{fig:MAR-3}
    \end{subfigure}
    \hspace{1cm}
    \begin{subfigure}[b]{0.31\textwidth}
        \includegraphics[width=\textwidth]{figures/MNAR-1.png}
        \caption{Performance of the MNAR algorithm in the described environment.}
        \label{fig:MNAR-1}
    \end{subfigure}
    \hfill
    \caption{Results corresponding to MCAR, MAR, and MNAR settings. The shaded regions represent the error bars, showing one standard deviation across multiple runs of the simulations.}
    \label{fig:combined}
\end{figure*}

\subsection{Experiment Setup of MAR}

The MAB environment is modeled with \( n = 10 \) arms but with \( K = 5 \) possible mediator values. 
The expected reward for all arms is determined by \( \{\mu_{m,a}\}_{m,a} \in \mathbb{R}^{n \times K} \), where \( \mu_{m,a} \) represents the mean reward for arm \( a \) when the mediator takes value \( m\). 
The latter reward matrix is chosen by sampling each \( \mu_{m,a} \) independently from a uniform distribution over \([0, 0.4]\). 
To ensure the first arm is the optimal one, an additional 0.6 is added to its corresponding mean.
The observation mechanism is defined by a matrix \(\{\gamma_{m,a}\}_{m,a} \in \mathbb{R}^{n \times K} \), where each \( \gamma_{m,a} \) is sampled independently from a uniform distribution over \([0.8, 1]\).
% , so \( \gamma_{i,j} \sim \text{Uniform}(0.8, 1.0) \). The value \( \gamma_{i,j} \) represents the probability that the reward for arm \( i \), mediator \( j \), will be observed.

For each arm \( a \), a categorical probability distribution \( \{p_{m,a}\}_{m} \in \mathbb{R}^{K} \) is defined over the \( K \) values of $M$.
% , where \( \sum_{j=1}^{k} p_{i,j} = 1 \). 
This distribution is drawn from a Dirichlet distribution, i.e., \( \{p_{m,a}\}_{m} \sim \text{Dirichlet}(\mathbf{1}_K) \). 
% When an arm \( i \) is selected, mediator's value \( m \) is set according to the probability measure \( \mathbf{p}_i \).
Upon pulling arm $a$ and the mediator taking value $m$,  reward \( Y_t \) is drawn from a normal distribution \( \mathcal{N}(\mu_{m,a}, 1) \), where \( \mu_{m,a} \) is the mean reward for arm \( a \) when mediator takes value \( m \). 
The reward is observed with probability \( \gamma_{m,a} \).
% and if it is not observed, it is treated as missing data.
We ran Algorithms~\ref{alg:mar_algorithm2} and~\ref{alg:mnar_algorithm} over a time horizon of \( T = 100,000 \). Their cumulative regret was averaged across 10 independent runs.  
As shown in Fig.~\ref{fig:MAR-1}., knowing conditional probabilities $p_{m,a}$ in advance improves the cumulative regret, as expected.
% \begin{figure}[h]
% \vspace{-.1in}
% \centerline{\includegraphics[width=0.45\textwidth]{figures/MAR-3.png}}
% \vspace{-.1in}
% \caption{Sample Figure Caption}
% \end{figure}
% In the other simulations the reward means \( \mu_{m,a} \) were sampled from \([0, 0.8]\), with an additional bias of \( +0.2 \) applied to the first arm. Observation probabilities \( \gamma_{m,a} \) were drawn from \([0.2, 1.0]\).
% 
% \begin{itemize}
%     \item \textbf{Uniform Distribution:} , \( p_{i,j} = \frac{1}{k} \).
%     \item \textbf{Peaked Distribution:} One mediator per arm has a higher probability, using a Dirichlet distribution biased by \( \alpha = 5 \) for the chosen mediator.
% \end{itemize}
% Each strategy was run for \( T = 100,000 \) iterations, repeated 20 times. We recorded the cumulative regret for each run and averaged the results.

Fig.~\ref{fig:MAR-2} demonstrates the average cumulative regret of the MAR algorithm with different probability distributions over the mediator.
% does not differ significantly. However, t
In particular, 
two mediator value selection strategies were tested:
(i) uniform, where each mediator value has an equal probability, and (ii) a peaked distribution, where one mediator per arm has a higher probability, using a Dirichlet distribution biased by \( \alpha = 5 \) for the chosen mediator.
The peaked distribution results in a higher cumulative regret, which aligns with the result from Theorem~\ref{theo:mar_lower_1}, since \( S \) is maximized when the probability distribution is concentrated on the largest \( \gamma_{m, a} \).


% \begin{figure}[h]
% \vspace{-.1in}
% \centerline{\includegraphics[width=0.45\textwidth]{figures/MAR-1.png}}
% \vspace{-.1in}
% \caption{MAR algorithm with different P initializations on MAR environment.}
% \end{figure}

In Figure~\ref{fig:MAR-3}, we compare the performance of the UCB and MAR algorithms in the MAR bandit environment. The results illustrate that the cumulative regret of the UCB algorithm is consistently higher than that of the MAR algorithm. Notably, the regret of the UCB algorithm exhibits a near-linear growth as a result of the bias in its estimation of the reward. This bias is due to the failure to account for the mediator structure. In contrast, the MAR algorithm, which explicitly utilizes mediators to handle missingness, achieves accurate reward estimation and a significantly lower regret.

% \begin{figure}[h]
% \vspace{-.1in}
% \centerline{\includegraphics[width=0.45\textwidth]{figures/MAR-2.png}}
% \vspace{-.1in}
% \caption{ggggg.}
% \end{figure}

\subsection{Experiment Setup of MNAR}
The MNAR algorithm was evaluated in an environment with \( n = 10 \) arms, \( K = 5 \) mediators, and \( \vert\mathcal{Y}\vert = 5 \) possible outcomes, over a horizon of \( T = 100{,}000 \), repeated 10 times. For each arm \( a \) and mediator \( m \), the reward function followed a categorical distribution sampled from a Dirichlet distribution, except for one arm which was sampled from a biased Dirichlet distribution. The bias was applied to the largest \( y \in \mathcal{Y} \), ensuring that this arm had a higher expected reward. 
The observation probabilities \( \gamma_{y,a} \) were drawn from a uniform distribution over \([0.5, 1.0]\), while the mediator probabilities were sampled from a Dirichlet distribution.
% To ensure more interpretable results, the values of \( \mathcal{Y} \) were set as \( [300, 1000, 400, 300, 200] \), but for the cumulative regret calculations, they were normalized to \( [0, 1] \). Moreover, the values of \( C_a \) from Assumption~\ref{ass:mnar_K_assumption} were set to the condition numbers.
Fig.~\ref{fig:MNAR-1} shows that the algorithm successfully adapts to the MNAR setup, effectively minimizing the cumulative regret. 

% \section{Model Selection for the Missingness Mechanism}


% ns where assumptions about the environment cannot be fully guaranteed.

\section{Limitations and Concluding Remarks}\label{sec:conc}
We studied multi-armed bandits with missing outcomes and adapted UCB algorithms to incorporate missingness.
Our approaches extend the applicability of MAB algorithms to a wider range of real-world online decision-making problems.
We expect that the insights given by this paper will help researchers to develop and adapt other existing decision-making algorithms to take missingness into account.
We assumed that the auxiliary (mediator) $M$ takes values in a finite set.
Parametric (or semiparametric) models can be adopted to relax this assumption in the future.
We further acknowledge that estimating the odds ratios through integral equations in the MNAR setting presents significant challenges, both in terms of computational complexity and sample efficiency. Hence, we have postponed the problem of MAB with continuous outcomes missing not at random to future work.

A practical challenge arises when choosing between several plausible settings for the missingness mechanism. To address this model selection problem, dynamic balancing offers a principled solution \cite{cutkosky2021dynamic}. The technique involves running an instance of our algorithm for each candidate setting and using a meta-algorithm to dynamically arbitrate between them based on performance.
This approach is analogous to recent methods for handling model uncertainty in causal bandits \cite{liu2024causal}. Its inclusion extends our framework to a more robust version capable of automatically selecting the most appropriate model, enhancing its practical applicability

\section*{Acknowledgments}
This research was in part supported by the Swiss National Science Foundation under NCCR Automation II, grant agreement 51NF40$\_$225155.

\bibliography{bibliography}