\section{Simple Regret Minimization}\label{sec: simple regret for general graphs}
\label{simple-regret}
In this section, we state and analyze our simple regret minimization algorithm called \SRM\ which takes as input an SMCG which is identifiable with respect to intervenable variables $\mathbf{X}$ (See Sec. \ref{sec: model and prelim} for definition).
Our proposed algorithm repeatedly plays the observational arm $a_0$ for the first $T/2$ rounds. Using this observational data, it determines a small set of arms to pull (i.e. perform interventions) in the next $T/2$ rounds and estimates their rewards using the interventional samples thus obtained. Finally, for the arms it does not pull, it uses the collected observational samples (from the initial $T/2$ pulls of $a_0$) to estimate their rewards by adapting a procedure from \cite{BhattacharyyaGKMV20} which efficiently estimates distributions resulting from an atomic intervention using observational samples. We remark that previous works in \cite{LattimoreLR16} and \cite{NairPS21} imposed structural restrictions on the input causal graphs which allowed observational samples to be  directly used for estimating rewards corresponding to interventions\footnote{The restrictions ensured that the conditional distributions are equal to the corresponding interventional distributions.}. \SRM, on the other hand, can work with more general identifiable SMCGs and still estimate rewards of multiple arms simultaneously using the observational arm pulls. \SRM\ is presented in Algorithm \ref{SR-algorithm}. We explain each step below.

\begin{algorithm}
\caption{\SRM} \label{SR-algorithm}
\begin{algorithmic}
\State INPUT: Causal Graph $\mathcal{G} = (\mathbf{V},\mathbf{E})$, set of intervenable nodes $\mathbf{X}\subseteq \mathbf{V}$ and time horizon $T$.
\end{algorithmic}
\begin{algorithmic}[1]
\setlength{\lineskip}{5pt}
\State $\mathsf{His} = \{\}$ ~\textcolor{gray}{/* $\mathsf{His}$ would be used to keep the history of sampled values in the first $T/2$ rounds. */}
\For {$t \in [1, \ldots , T/2]$}
    \State Play arm $a_0$ and let $\mathsf{His} =\mathsf{His} \cup  \{\mathbf{V(t)}\setminus \mathbf{U}(t), Y(t)\}$.
\EndFor


\State For each $i\in [n]$, compute $\widehat{q}_{i}$ (as defined in Equation \ref{eqn:q-hat}) and $\widehat{m}$ (as an estimate of $m$) by plugging in $\widehat{q}_{i}$ in place of $q_{i}$ in Equation \ref{eqn:m}. Let $\mathcal{Q} = \{a_{i,x} \in \mathcal{A}: \widehat{q}_{i}^{k_i} < 1/\widehat{m}\}$.
\For {$a_{i,x} \in \mathcal{Q}$}
    \State Play arm $a_{i,x}$ and observe $Y$ for $\frac{T}{2|\mathcal{Q}|}$ rounds.
    \State Estimate reward as $\widehat{\mu}_{i,x} = \frac{2|\mathcal{Q}|}{T} \sum_{t=1}^{T/2|\mathcal{Q}|} Y(t)$.
\EndFor

\For {$a_{i,x} \not\in \mathcal{Q}$}
    \State For each $i\in [n], x\in \{0,1\}$, use Algorithm \ref{alg: estimating rewards from observations} in App. \ref{secappendix: estimating rewards from observations} with inputs $\mathcal{G},\mathsf{His}$ to get reward estimate $\widehat{\mu}_{i,x}$.
\EndFor


\State Return estimated optimal $a_T^* \in \arg\text{-}\max_{a \in \mathcal{A}} \widehat{\mu}_a$.
%\EndProcedure
\end{algorithmic}
\end{algorithm}

\emph{Steps 1--4}: At Steps $1-3$, \SRM\ collects $T/2$ observational samples from pulls of $a_0$ and at Step $4$ it identifies a set of arms $\mathcal{Q}$ whose reward estimates (when computed using the collected observational samples) will be bad\footnote{Can be seen easily using Lemma \ref{mu-estimation-lemma}.}. This is done using a quantity $m(\mathcal{C})$ defined next; the meaning of relevant notations can be found in Sec. \ref{sec: model and prelim}.
Let $q_{i} = \min_{\mathbf{z},x} \mathbb{P}(X_i = x, \mathbf{Pa}^c(X_i) = \mathbf{z})$. For each $\tau \in [2,2N]$, let $I_\tau = \{i : q_{i}^{k_i} < 1/\tau\}$\footnote{Recall from Sec. \ref{sec: model and prelim} that $k_i$ is size of the c-component of $X_i$.}. We define, 
\begin{equation}
\label{eqn:m}
    m(\mathcal{C}) = \min \{\tau : |I_\tau| \leq \tau \}.
\end{equation}
The observational samples that were collected are used to first compute estimates $\widehat{q_i}$ of $q_i$ given as:  
\begin{equation}
\label{eqn:q-hat}
    \widehat{q}_{i} = 
 \Big(\frac{2}{T}\Big)\cdot \min_{\mathbf{z},x} \Big\{ \sum_{t=1}^{T/2} \mathds{1}\{X_{i}(t) = x, \mathbf{Pa}^c(X_i)(t) = \mathbf{z}\} \Big\}
\end{equation}

These estimates are then plugged into the above definition of $m(\mathcal{C})$ to obtain it's estimate $\widehat{m}$. Finally the set of arms $\mathcal{Q}$ is defined as  $\mathcal{Q} = \{a_{i,x} \in \mathcal{A}: \widehat{q}_{i}^{k_i} < 1/\widehat{m}\}$.


\emph{Steps 5--10}: Since, using observational samples reward estimates of arms in $\mathcal{Q}$ will be bad, in Steps $5-7$, we pull these arms equal number of times by performing the corresponding interventions and estimate their rewards directly from the interventional samples. The observational samples collected in first $T/2$ rounds are used to compute the estimates for each arm $a_{i,x} \not\in \mathcal{Q}$ at Steps $8,9$. Reward estimates of these arms are computed using Algorithm \ref{alg: estimating rewards from observations}, App. \ref{secappendix: estimating rewards from observations}. Finally in Step $10$, we return arm $a_{i,x}$ with the best reward estimate. Even though Algorithm \ref{alg: estimating rewards from observations} uses \cite{BhattacharyyaGKMV20}, which assumes strong positivity,  we do not need to
explicitly make this assumption since low probability arms $a_{i,x}$ get pulled (by intervention) in Step $6$ and only high probability arms are estimated using Algorithm \ref{alg: estimating rewards from observations}.

\paragraph{Some remarks about $m(\mathcal{C})$:} Our definition of $m(\mathcal{C})$ above is a novel extension of a similar quantity $m$ defined in \cite{LattimoreLR16} and reduces to $\Theta(m)$ for parallel graphs \citep{LattimoreLR16} and no-backdoor graphs \citep{NairPS21}. As a result, the regret guarantee of \SRM\ for these special classes of graphs matches those of the respective algorithms in these works. Operationally, $m(\mathcal{C})$ determines for us the optimal number of arms to be pulled in Steps $5-7$, in order to minimize expected regret. In particular, $I_{m(\mathcal{C})}$
is a set of arms such that the best arm in it (found using $T/2$ rounds of interventions) and the best arm in its complement $I_{m(\mathcal{C})}^c$ (found using $T/2$ rounds of observations) have reward estimates of similar quality.

 We show that the expected simple regret of \SRM\ in Theorem \ref{theorem: UB-SR} stated below is $\tilde{O}(\sqrt{m(\mathcal{C})/T})$, which is an instance-dependent regret guarantee as $m(\mathcal{C})$ depends on the input CBN. If $m(\mathcal{C}) \ll N$ then \SRM\ performs better than the optimal \texttt{MAB} algorithm. In particular, \SRM\ explores only at most $2\widehat{m}+1$ arms compared to the $2N$ arms that must be explored by a standard best-arm identification \texttt{MAB} algorithm which achieves  $\Omega(\sqrt{N/T})$ expected worst-case simple regret \citep{AudibertBM10}. It is easy to see that there are CBNs $\mathcal{C}$ with $m(\mathcal{C}) \ll N$ as illustrated in App. \ref{appendix:example}. The proof of Theorem \ref{theorem: UB-SR} is given in App. \ref{secappendix: proof of SRM}. 

\begin{theorem} \label{theorem: UB-SR}
The expected simple regret of \SRM\ at the end of $T$ rounds is $r_{\SRM}(T) = O\bigg(\sqrt{\frac{m(\mathcal{C})}{T}\log \frac{NT}{m(\mathcal{C})}}\bigg)$.
\end{theorem}
\textbf{Remark:} The constant involved in the regret expression is exponential in $\max_i \{k_i\}$ and $\max_i \{|\mathbf{Pa}^c(X_i)|\}$. Recall that these are constant as per our assumptions in Sec. \ref{sec: model and prelim}.
