\section{Introduction}
Causal Bayesian Networks or CBNs \citep{Pearl00} have become the natural choice for modelling causal relationships in many real-world situations such as price-discovery \citep{Haigh2004}, computational-advertising \citep{Bottou2013}, healthcare \citep{Velikova2014}, etc.
A CBN has two components: a directed acyclic graph (DAG) called the causal graph, and conditional probability distributions of each node given its parents such that the  joint distribution of all variables factorizes as a product of these conditionals. Edges in the causal graph represent direct causal relationships and therefore it captures the data generation process.

In its most general setup, only a subset of the variables appearing in the CBN are observable and the rest are unobserved (see Definition $1.3.1$ in \cite{Pearl00}). 
CBNs enable modelers to simulate the effect of external manipulations via a process called \emph{intervention}. 
An intervention forcibly fixes selected observable variables in the graph and breaks the edges coming into them.
Data generated from the resulting model is the simulated outcome of the intervention. In the presence of an outcome variable of interest $Y$ (assumed to be observable), a natural question (see epidemic prevention example below) is to find the variable $X$ and a corresponding value $x$, such that the intervention setting $X$ to $x$ leads to the maximum expected value of $Y$ i.e. $X=x$ has the highest causal impact on $Y$. Such an intervention which manipulates only a single variable is called an \emph{atomic intervention}.

The problem of learning the best atomic intervention was formulated as a sequential decision making problem called \emph{Causal Bandits} (\CB) in \citet{LattimoreLR16}. In \CB,  access to the underlying DAG of the CBN is assumed but the associated conditional probability distributions are unknown. The outcome variable $Y$ is considered as a reward variable and the set of allowed atomic interventions are modelled as arms of a bandit instance. In addition, there is an \emph{observational arm} corresponding to the empty intervention, and pulling the observational arm generates a sample from the joint distribution of all observable variables. Here, identifying the best atomic intervention is equivalent to the well-studied \emph{best-arm identification} problem in a multi-armed bandit (\MAB) instance. However, in \CB, an algorithm while pulling an arm has access to \emph{causal} side information derived from the causal graph associated with the input CBN. See \cite{LattimoreLR16} and the references therein for a comparison of \CB\ and \MAB\ problems with other types of side-information.

In this work, we study \CB\ for causal graphs with unobserved confounders. (UCs). These are unobserved variables that are parents of at least two observable variables. To the best of our knowledge, this is the first work that analyses the regret of causal bandit algorithms when input causal graphs contain UCs. Moreover, in the fully observable setting, i.e. when all variables are observable, our algorithm does not assume any structural constraints on the input causal graph and hence can be applied quite generally.
Before stating our contributions, we provide a motivating example where determining the best atomic intervention is important.
\begin{figure}
     \centering
     \includegraphics[scale=0.6]{figures/covid-CG.png}
     \caption{Causal Graph: Epidemic Prevention with Social and Economic Factors being an Unobserved Confounder }
     \label{fig:exampleGraph}
\end{figure}
Suppose a policy-maker is required to identify the best precautionary measure that should be enforced to reduce spread of a disease. The available measures are mandating social distancing, wearing of face mask, making people work from home and preventive vaccinations. Since the effect of each measure needs to be isolated while disrupting public life minimally, the policy-maker can enforce at most one of these measures at a given time.
%To isolate the effect of each measure and to avoid disrupting public life, they would want to enforce at most one of these at a given time. 
The policy-maker can conduct surveys to collect data from public about which measures were taken by them (other than the one enforced) and whether they got infected or not. The goal would then be to design a mechanism of implementing such enforcement one by one, during a time period and collecting the respective survey data to identify the best measure to enforce. Note that, using domain knowledge of health experts, policy makers can have access to an underlying causal graph such as the one in Fig. \ref{fig:exampleGraph}.  They would want to use this graph to decide if and when a particular measure should be enforced.


\subsection{Our Contributions}
We study \CB\ with respect to two standard objectives in \MAB: simple and cumulative regret. Simple regret captures the best arm identification problem described above, whereas the cumulative regret is more natural when the goal is to maximize the cumulative reward at the end of $T$ rounds instead of determining the best arm. We state our contributions below; meanings of the relevant terminologies are defined in Sec. \ref{sec: model and prelim}.

\textbf{Simple Regret Minimization}: We propose a simple regret minimization algorithm called \SRM.
The input causal graph of the underlying CBN $\mathcal{C}$ is assumed to be (without loss of generality) a Semi-Markovian Causal Graph or SMCG (defined in Sec. \ref{sec: model and prelim}) on observable nodes having both directed and bi-directed edges (representing presence of UCs) between the nodes. We assume that the input SMCG is identifiable with respect to a set of intervenable nodes $\mathbf{X}$, meaning that the interventional distributions arising from atomic interventions on variables in $\mathbf{X}$ can be consistently estimated from the observational data itself (see definition in Sec. \ref{sec: model and prelim}). When the c-components (connected components of bi-directed edges, defined in Sec. \ref{sec: model and prelim}) of the SMCG are bounded in size (by a constant) and the total in-degree of all vertices in the c-components are also bounded, given a time budget of $T$ rounds, \SRM\ attains $\Tilde{O}(\sqrt{m(\mathcal{C})/T})$ expected simple regret (see Thm. \ref{theorem: UB-SR}). Here, $m(\mathcal{C})$ depends on the input CBN and is $\leq 2N$, where $N$ is the number of intervenable nodes i.e. $N = |\mathbf{X}|$.

In Sec. \ref{sec: simple regret for general graphs} we give examples of graphs, where $m(\mathcal{C}) \ll N$, and hence \SRM\ performs better than standard bandit algorithms which achieve $\Omega(\sqrt{N/T})$ expected simple regret in the worst-case (Thm. 4 in \cite{AudibertBM10}).
\SRM\ leverages the causal side-information available by deriving reward estimates for each arm from pulls of the observational arm. The quality of these derived estimates depends on the input CBN. The quantity $m(\mathcal{C})$ intuitively captures the trade-off between the number of arms with bad estimates, and the quality of estimates determined from intervening upon them explicitly.

We note that \citet{LattimoreLR16} and \citet{NairPS21} propose algorithms in the fully observable setting, when the input causal graph is a parallel graph and a no-backdoor graph, respectively.\footnote{These graphs have no backdoor paths from any $X\in \mathbf{X}$ to $Y$, implying $\mathbb{P}(Y\mid do(x)) = \mathbb{P}(Y\mid x)$.} 
For these special classes of graphs, \SRM\ recovers the regret guarantees given in \cite{LattimoreLR16,NairPS21}. Hence, \SRM\ can be viewed as a \emph{significant} generalization of these algorithms to more general  causal graphs with UCs. Further, \citet{YabeHSIKFK18} proposed a causal bandit algorithm for interventions which can simultaneously manipulate multiple variables. However, the input causal graph is assumed to have no UCs, and regret guarantee of their algorithm is $\tilde{O}(\sqrt{N/T})$ for atomic interventions.  In particular, its performance is not better than optimal \MAB\ algorithms that do not take causal side-information into account.
In Sec. \ref{sec: experiments}, we experimentally compare the regret guarantee of \SRM\ with the algorithm in \cite{YabeHSIKFK18}, as well as \MAB\ algorithms that do not take causal side-information into account.
 
\textbf{Lower Bound on Simple Regret}: We also show that \SRM\ is almost optimal for CBNs associated with a large and important class of causal graphs. Specifically, in Thm. \ref{theorem: LB-Tree}, we show that for any causal graph $\mathcal{G}$ which is an $n$-ary tree \GAU{on} $N+1$ nodes\footnote{$N$ nodes are intervenable and can causally effect node $Y$.}, and any $M\in [1,N]$ there is a probability distribution $\mathbb{P}$ compatible with the the causal graph such that $m(\mathcal{C}) = M$, and the expected simple regret of any algorithm at the end of $T$ rounds is $\Omega(\sqrt{M/T})$\footnote{Here $\mathcal{C}$ is the CBN $(\mathcal{G}, \mathbb{P})$ and $m(\mathcal{C})$ is as described before.}. We remark that these graphs naturally capture important CBNs like causal trees \citep{GreenewaldKSMKA19}. Also, the class of graphs considered in Thm. \ref{theorem: LB-Tree} subsumes the parallel graph model, and for them lower bound in Thm. \ref{theorem: LB-Tree} matches the lower bound given in \cite{LattimoreLR16}. Importantly, Thm. \ref{theorem: LB-Tree} implies that the regret guarantee of \SRM\ can be only improved by considering more nuanced structural restrictions on the causal graph, which could enable more causal information sharing between the interventions.

\textbf{Cumulative Regret Minimization}: We propose a cumulative regret minimization algorithm called \CRM. All variables in the input causal graph are assumed to be observable. \CRM\ achieves constant expected cumulative regret if the observational arm is optimal, and otherwise achieves better regret than the optimal \texttt{MAB} algorithm which does not take causal side-information into account (see Thm. \ref{theorem: UB-CRM}). Cumulative regret minimization in general graphs were also studied by \citet{LU2020} and \citet{NairPS21}. However, they crucially assume that distribution of parents of the reward node is known for every intervention. \CRM\ does not make this assumption. The reason why we develop \CRM\ in the fully observable setting (unlike \SRM) is rather technical and is explained at the end of Sec. \ref{sec: cumulative regret}.

\subsection{Related Work} 
As noted before, causal bandits was introduced in \cite{LattimoreLR16}, where  an almost optimal algorithm was proposed for CBNs associated with a parallel causal graph. Recently, a similar algorithm for simple regret minimization along with an algorithm for cumulative regret minimization was proposed for no-backdoor graphs in \cite{NairPS21}, and the observation-intervention trade-off was studied when interventions are costlier than observations. An importance sampling based algorithm was proposed by \citet{SenSDS17} to minimize simple regret but only soft-interventions at a single node were considered. The cumulative regret minimization problem for general causal graphs was studied in \cite{LU2020,NairPS21}, but they assume the knowledge of the distributions of the the parents of the reward variable for every intervention.
Recently \cite{Lu2021} designed a cumulative regret minimization algorithm which only utilizes the side information that the underlying causal graph is a directed tree or a causal forest (and does not require the exact DAG). Assuming faithfulness and identifiability, their algorithm outperforms the standard MAB algorithms.
\citet{SenSKDS17} studied the contextual bandit problem where the observed context influences the reward via a latent confounder variable, and proposed an algorithm with better guarantee compared to standard contextual bandit. \cite{LeeB18,LeeB19} gave a procedure to compute the minimum possible intervention set by removing sub-optimal interventions identifiable from the input causal graph, and they empirically demonstrated that ignoring such information leads to huge regret. Recently, \cite{lu2021causal} introduced the causal Markov decision processes, where at each state a causal graph determines the action set, and gave algorithms that achieve better policy regret when the causal side-information is taken into account. 
Finally, in a related line of work \citet{BareinboimFP15} promote the use of observational data for bandit problems in the presence of UCs. We note that our proposed algorithms \SRM\ and \CRM\ both use observational samples to leverage side-information and hence achieve better regrets.