\section{Cumulative Regret Minimization}\label{sec: cumulative regret}
\label{cumulative-regret}
In this section, we propose \CRM, an algorithm  based on the well-known \UCB\ algorithm \citep{AuerCF02}, that sequentially performs (atomic) interventions and minimizes the cumulative regret incurred over the time horizon $T$. Unlike \SRM, here we assume all nodes in the input graph $\mathcal{G}$ are observable and the joint distribution $\mathbb{P}$ is strictly positive\footnote{Strict positivity of the joint distribution is often assumed in the causality literature \citep{Hauser2012}.}. Similar to the \UCB\ family of algorithms \CRM\ maintains \UCB\ estimates at each round and pulls the arm with the highest \UCB\ estimate. \CRM\ performs better than the standard \UCB\ algorithm \citep{AuerCF02} by leveraging (via \emph{backdoor criterion} \citep{PEARL2009}) the available causal side-information. In particular, \CRM\ uses the samples from the observational arm pulls in addition to the samples from the arm pulls of $a_{i,x}$ to compute \UCB\ estimates of $a_{i,x}$.
Note that even though the observational arm may not be reward optimal, pulling it gives a simultaneous causal side-information about all the arms. \CRM\ ensures a good trade-off between such a simultaneous exploration and the possible loss in reward by ensuring that $a_0$ is pulled at least a pre-specified (carefully chosen) number of times. We note that \texttt{CRM-NB-ALG} proposed for no-backdoor graphs in \cite{NairPS21}, also ensures that the observational arm $a_0$ is pulled a pre-specified number of times, but \CRM\ differs from \texttt{CRM-NB-ALG} on how the \UCB\ estimates for the arms are computed at the end of each round.  Next, we present the details of \CRM.  

\begin{algorithm}
\caption{\CRM\ (Minimizing cumulative regret in general causal graph)} \label{CR-algorithm}
\begin{algorithmic}
\State INPUT: Causal graph $\mathcal{G}$ and the set of intervenable nodes
\end{algorithmic}
\begin{algorithmic}[1]
\State Pull each arm once and set $t = 2N+2$
\State Let $\beta = 1$
%Update $T = T - 2N - 1$ and l
\For {$t = 2N+2, 2N+3, \ldots$}
    \If {$N_{t-1}^0 < \beta^2 \log t$}
        \State Pull $a_t = a_0$
    \Else 
        \State Pull $a_t = \arg\max_{a \in A} \bar{\mu}_a(t-1)$
    \EndIf
    \State $N_t^a = N_{t-1}^a + \mathds{1}\{a_t = a\}$
    \vspace{3pt}
    \State Update $\widehat{\mu}_a(t)$ and $\bar{\mu}_a(t)$ for all $a \in A$ according to Equations \ref{equation: emprical estimate for arm 0 modified}, \ref{equation: emprical estimate for arm i,x modified}, \ref{equation: UCB estimate for arms} and \ref{equation: UCB estimate for arm 0}.
    \vspace{3pt}
    \State Let $\widehat{\mu}^* = \max_a \widehat{\mu}_a(t)$
    \vspace{3pt}
    \If {$\widehat{\mu}_0(t) < \widehat{\mu}^*$}
        \State Set $\beta = \min \{\frac{2\sqrt{2}}{\widehat{\mu}* - \widehat{\mu}_0(t)}, \sqrt{\log t}\}$
    \EndIf    
    \State $t = t+1$
\EndFor
\end{algorithmic}
\end{algorithm}

We use $N_t^{i,x}$ and $N_t^0$ to denote the number of times arms $a_{i,x}$ and $a_0$ have been played at the end of $t$ rounds respectively, and further let $a_t$ denote the arm pulled at round $t$. Also, $\widehat{\mu}_{i,x}(t)$ and $\bar{\mu}_{i,x}(t)$ (respectively $\widehat{\mu}_0(t)$ and $\bar{\mu}_0(t)$) denotes the empirical and \UCB\ estimates of the arm $a_{i,x}$ (respectively arm $a_0$) computed at the end of round $t$. 
At Step 4 \CRM\ checks if the observational arm is pulled at least $\beta^2\log t$ times, and accordingly either plays the observational arm or the arm with the highest \UCB\ estimate. Here the value of $\beta$ is updated as in Steps 11-12 . As noted before, the chosen update for $\beta$ and the corresponding pre-specified number of pulls for arm $a_0$ delicately balances the exploration-exploitation trade-off in expectation. 
The empirical estimate for arm $a_0$ at Step 9 is computed as follows
\begin{equation}\label{equation: emprical estimate for arm 0 modified}
    \widehat{\mu}_0(t) = \frac{1}{N_t^0}\sum_{s=1}^t \mathds{1}\{Y(s)=1, a_s=a_0\}~.
\end{equation}
The empirical estimate for arm $a_{i,x}$ is involved, and as mentioned before is done by leveraging the following backdoor criterion (see Thm. 3.3.2 in \cite{PEARL2009}).

    \begin{align}\label{eqn:backdoor-criterion}
        &\mathbb{P}(Y=1\mid do(X_i =x))=\\ 
        &\sum_{\mathbf{z}} \mathbb{P}(Y=1 \mid X_i=x,\mathbf{Pa}(X_i)=z)\mathbb{P}(\mathbf{Pa}(X_i)=\mathbf{z}) \nonumber
    \end{align}

Let the set of time steps $s \leq t$ at which arm $a_0$ is pulled be denoted by $S_t = \{t_1, \ldots, t_{N_t^0}\}$. Partition $S_t$ into two parts $O_t$ containing all the time steps with odd indices (i.e. $t_1, t_3$, etc.) and $E_t$ containing all the time steps with even indices (i.e. $t_2, t_4$, etc.). We will now define some sets and intermediate estimators in order to describe the final estimator. Since $X_i$ is clear from the context, we do not use $i$ to index these intermediate estimators. In general these sets and estimators will be different for different $i$. We use time steps in $O_{t}$ to estimate $\mathbb{P}(Y=1 \mid X_i=x,\ \mathbf{Pa}(X_i)=\mathbf{z})$, and those in $E_t$ to estimate $\mathbb{P}(\mathbf{Pa}(X_i)=\mathbf{z})$. These probabilities are estimated on disjoint sets of time steps to make the estimators independent of each other which we require while showing that the estimator is unbiased (Lemma \ref{lemma: unbiased muix} in App. \ref{secappendix: proof of CRM}). To estimate the above mentioned probabilities, we focus on the subsets
\[
O_t^{x,z} = \{s\in O_t \mid X_i(s) = x, \mathbf{Pa}(X_i)(s) = \mathbf{z}\} \subseteq O_t
\]

Let $C_t^x$ be the minimum value of $|O_t^{x,z}|$ (as $\mathbf{z}$ is varied). To use time steps in $E_t$ for estimating  $\mathbb{P}(\mathbf{Pa}(X_i)=\mathbf{z})$, we partition this set into $C_t^{x}$ many parts\footnote{Each part has at least $\lfloor|E_t|/C_t^{x}\rfloor$ elements. Choice of $C_t^x$ helps in bounding regret (Lemma \ref{lemma: concentration bounds on mu in cumulative regret} in App. \ref{secappendix: proof of CRM}).}, say $E_t = E_{t,1}\cup \ldots \cup E_{t, C_t^{x}}$. For each part $E_{t,c}$, $c\in [C_t^x]$, we create an estimator of the probability $\mathbb{P}(\mathbf{Pa}(X_i)=\mathbf{z})$ as follows:
\[
{\widehat{p}_{t, c}}^{~\mathbf{z}} = \sum\limits_{s \in E_{t,c}}\frac{ \mathds{1}\{\mathbf{Pa}(X_i)(s) = \mathbf{z}\}}{|E_{t, c}|}
\]
Now we are ready to build an estimator using Equation \ref{eqn:backdoor-criterion}. Let $s_1, \ldots, s_{C_t^x}$ be any distinct elements\footnote{They exist since $|O_t^{x,z}|\geq C_t^x$.} of set $O_t^{x,z}$. For each $c\in [C_t^x]$, we define a variable $Y_c^x$ as follows:
\[
Y_c^{x} = \sum_{\mathbf{z}} \mathds{1}\{Y(s_{c})=1\}\widehat{p}_{\ t,c}^{~\mathbf{z}}
\]
 Let $S_t^{i, x}$ be the set of timestamps $s \leq t$, when arm $a_{i,x}$ is pulled. Our final empirical estimator $\widehat{\mu}_{i,x}(t)$ of arm $a_{i,x}$ is:
\begin{equation}\label{equation: emprical estimate for arm i,x modified}
    \widehat{\mu}_{i,x}(t) = \frac{\sum_{s \in S_t^{i,x}}\mathds{1}\{Y(s)=1\} + \sum_{c \in [C_t^{x}]} Y_c^{x}}{N^{i,x}_t + C^{x}_t}
\end{equation}

It is easy to see that $\mathbb{E}[\widehat{\mu}_0(t)] = \mu_0$, and in Lemma \ref{lemma: unbiased muix} (App. \ref{secappendix: proof of CRM}) using backdoor criterion (Sec. 3.3.1 in \cite{PEARL2009}) we show that $\mathbb{E}[\widehat{\mu}_{i,x}(t)] = \mu_{i,x}$ for every $i,x$. Finally, \CRM\ uses Equations \ref{equation: emprical estimate for arm 0 modified} and \ref{equation: emprical estimate for arm i,x modified} to compute the \UCB\ estimates $\bar{\mu}_{i,x}(t)$ and $\bar{\mu}_0(t)$ of arms $a_{i,x}$ and arm $a_0$ respectively
\begin{equation}\label{equation: UCB estimate for arms}
    \bar{\mu}_{i,x}(t) = \widehat{\mu}_{i,x}(t) + \sqrt{\frac{2 \ln t}{N_t^{i,x} + C^{x}_t}}
\end{equation}
\begin{equation}\label{equation: UCB estimate for arm 0}
    \bar{\mu}_0(t) = \widehat{\mu}_0(t) + \sqrt{\frac{2 \ln t}{N_t^0}}
\end{equation}
We bound the expected cumulative regret of \CRM\ in Thm. \ref{theorem: UB-CRM}, where $\GAU{a^{*}} = \arg\text{-}\max_{a \in A} \mu_a$ and, for $a\in \mathcal{A}$, $\Delta_a = \mu_{a*} - \mu_a$, $p^{i,x}_{\mathbf{z}} = \mathbb{P}(X_i = x, \mathbf{Pa}(X_i) = \mathbf{z})$, $p_{i,x} = \min_{\mathbf{z}} p^{i,x}_{\mathbf{z}}$. Additionally, $\eta^{i,x}_T$ denotes the probability that the empirical estimate of $p_{i,x}$ at time $T$ is large (See Observation \ref{obs:eta} in App. \ref{secappendix: proof of CRM}) and is defined as $
\eta^{i,x}_T = \max \big\{0, \big(1 - Z_i T^{-\frac{p_{i,x}^2}{4}}\big)\big\}$,
 where $Z_i$ is the size of the domain of $\mathbf{Pa}(X_i)$. 

\begin{theorem}\label{theorem: UB-CRM}
If $\GAU{a^{*}} = a_0$, then at the end of $T$ rounds the expected cumulative regret is $O(1)$. Otherwise, the expected cumulative regret is of the order $\frac{58\ln{T}}{\Delta_0} + \Delta_0 + \sum_{\Delta_{i,x}>0} \Delta_{i,x} \max \bigg\{0, 1 + 8\ln{T}\bigg(\frac{1}{\Delta_{i,x}^2} - \frac{p_{i,x} \cdot \eta_T^{i,x}}{36 \Delta_0^2} \bigg) \bigg\} + \sum_{\Delta_a > 0} \Delta_a \frac{\pi^2}{3}$.
\end{theorem}

The proof of Thm. \ref{theorem: UB-CRM} is given in App. \ref{secappendix: proof of CRM}. Notice that the regret guarantee in Thm. \ref{theorem: UB-CRM} is an instance dependent  constant if $a_0$ is optimal and otherwise slightly better than the \UCB\ family of algorithms. 
 Also, it is easy to construct examples of CBNs where the observational arm is optimal, for example see Experiment $2$ in Sec. \ref{sec: experiments}.

\paragraph{No unobserved variable assumption:} As mentioned, in \CRM\ we work in the fully observable setting unlike \SRM\ (Sec. \ref{sec: simple regret for general graphs}). A natural question is whether Algorithm \ref{alg: estimating rewards from observations} (App. \ref{secappendix: estimating rewards from observations}) can also be used in \CRM\ in the presence of unobserved confounders. We believe there is no straight-forward way to accomplish this due to a rather technical reason. Our estimator (Equation \ref{equation: emprical estimate for arm i,x modified}) cleverly interprets observational samples as $C_t^{x}$ many interventional samples and can be shown to be unbiased (Lemma \ref{lemma: unbiased muix}, App. \ref{secappendix: proof of CRM}).
The technique in Algorithm \ref{alg: estimating rewards from observations} does not enable the same interpretation of observational samples, and hence cannot be easily used to create an estimator with similar properties.