\section{Notations and Preliminaries}\label{sec: model and prelim}
\paragraph{Notations:} For positive integers $m,n$ with $m<n$, $[n]$ denotes the set $\{1,\ldots,n\}$ and $[m,n]$ denotes the set $\{m,m+1,\ldots,n\}$. Random variables are denoted using capital letters (e.g. $X$) and corresponding lower case letters (i.e. $x$) denote the assignment $X=x$. Unless otherwise specified, all random variables will be discrete with finite support. Sets of random variables are denoted by bold face letters (e.g. $\mathbf{X}$) and corresponding bold face lower case letter (i.e. $\mathbf{x}$) denotes the assignment $\mathbf{X} = \mathbf{x}$. We use $\mathbb{P}(\mathbf{X}=\mathbf{x})$ (equivalently $\mathbb{P}(\mathbf{x})$) to denote the probability of $\mathbf{X}$ taking the value $\mathbf{x}$. Conditional probability of $\mathbf{X}=\mathbf{x}$ given $\mathbf{Y} = \mathbf{y}$ is denoted by $\mathbb{P}(\mathbf{x}\mid \mathbf{y})$. Size of any set $S$ is denoted by $|S|$.  
\paragraph{Causal Bayesian Network:}
A Bayesian Network or BN is a tuple $(\mathcal{G},\mathbb{P})$, where $\mathcal{G} = (\mathbf{V},\mathbf{E})$ is a directed acyclic graph (DAG), and $\mathbf{V} = \{V_1,\ldots,V_n\}$ and $\mathbf{E}$ are the set of nodes and edges in $\mathcal{G}$ respectively. A node $V_i$ is called the parent of $V_j$ and $V_j$ the child of $V_i$, if there is a directed edge from $V_i$ to $V_j$ in $\mathbf{E}$. The nodes in $\mathbf{V}$ are labelled by random variables, and $\mathbb{P}$ is the joint distribution of $\mathbf{V}$ that factorizes over $\mathcal{G}$, i.e. $
\mathbb{P}(\mathbf{V}) = \prod_{i=1}^n \mathbb{P}(V_i\mid \mathbf{Pa}(V_i))$,
where $\mathbf{Pa}(V_i)$ is the set of parents of $V_i$.  Sometimes, in a BN, certain nodes are not observable and are termed unobserved variables. In this situation, for each node $V_i\in \mathbf{V}$, $\mathbf{Pa}(V_i)$ will denote the set of \emph{observable} parents of $V_i$. A \textbf{Causal Bayesian Network} or CBN is a BN where each edge denotes an immediate causal relationship. The graph $\mathcal{G}$ (called the \textbf{Causal Graph}) corresponding to a CBN describes the data generation process not just of the observational distribution $\mathbb{P}$ but also of interventional distributions that can be derived from it. 
An intervention on an observable node $X \in \mathbf{V}$ is denoted as $do(X=x)$, where $X$ is set to value $x$ and all the edges coming in to $X$ are removed. The resulting graph defines a probability distribution $\mathbb{P}(\mathbf{V}\setminus \{X\} \mid do(X=x))$ over $\mathbf{V}\setminus \{X\}$, called an interventional distribution. In the presence of unobserved variables it is convenient to assume (without loss of generality) \citep{TianP02, Verma1988} that the underlying graph $\mathcal{G}$ of the CBN is semi-markovian. Formally, a \textbf{Semi-Markovian Causal Graph} or an SMCG is a DAG, where every unobserved variable is a root and has exactly two observable children \citep{TianP02, Acharya2018}. These unobserved variables are called \textbf{Unobserved Confounders} or UCs in the rest of the paper. It is convenient to represent SMCGs with observable vertices only by adding a bi-directed edge between two observable vertices if they have a common unobserved parent and removing the unobserved parent from the graph \citep{TianP02}. Such graphs thus comprise of both directed and bi-directed edges with all vertices $\mathbf{V} = \{V_1,\ldots,V_n\}$\footnote{By abuse of notation we denote the set of observable vertices by $\mathbf{V}$ from here on wards.} observable. The bi-directed edges can be used to partition the observable vertices into what are called \textbf{c-components}\footnote{If a node is not incident by any bi-directed edge then its c-component is itself.} \citep{TianP02}. Two observable vertices are said to be in the same c-component if and only if they are connected by a path of bi-directed edges. Let $\mathbf{X} = \{X_1,\ldots,X_N\}\subset \mathbf{V}$ denote the set of intervenable nodes. The c-component containing the node $X_i$ is denoted by $S_i$ and its size is denoted by $k_i$ i.e. $k_i = |S_i|$. The number of observable parents of $X_i$ is denoted by $d_i$, i.e. $d_i = |\mathbf{Pa}(X_i)|$. We define $\mathbf{Pa}^{+}(S_i) = S_i \cup \bigcup_{V \in S_i} \mathbf{Pa}(V)$,  and $\mathbf{Pa}^c(X_i) = \mathbf{Pa}^{+}(S_i)\setminus \{X_i\}$. For more details on these definitions we refer the reader to \cite{TianP02, Verma1988, Acharya2018}. An important question that arises in the context of SMCGs is that of \textbf{identifiability}, which asks whether the interventional distributions $\mathbb{P}(\mathbf{V}\setminus \{X\} \mid do(X=x))$ can be estimated consistently using observational data sampled from $\mathbb{P}(\mathbf{V})$. In the case of atomic interventions,  \cite{Tian2002b} provided a necessary and sufficient condition for this to happen. They show that (Thm. $3$ in \cite{Tian2002b}) $\mathbb{P}(\mathbf{V}\setminus \{X\} \mid do(X=x))$ is identifiable if and only if there is no bi-directed path connecting $X$ to any of its children. For this work, we say that an ``SMCG is identifiable with respect to variables in $\mathbf{X}$'' if the identifiability condition mentioned above holds for all intervenable nodes in $\mathbf{X}$. Note that, when all variables are observable, there are no bi-directed paths and the interventional probabilities are always identifiable. Moreover, in the observable setting one can use the \emph{backdoor criterion} \citep{PEARL2009} for estimating the interventional probabilities from observational samples. In the general setting, \cite{BhattacharyyaGKMV20} provides an efficient procedure (based on construction in \cite{Tian2002b}) to estimate the interventional distribution using  observational samples which we use in Sec. \ref{sec: simple regret for general graphs}.

\paragraph{Causal Bandits:}
A causal bandit algorithm receives as input a causal graph $\mathcal{G} = (\mathbf{V}, \mathbf{E})$ (corresponding to some CBN $\mathcal{C}$), the associated set of (binary) intervenable nodes $\mathbf{X} \subseteq \mathbf{V}$ and the designated (observable) reward node $Y\in \mathbf{V}$.  We assume there are $N$ intervenable nodes $\mathbf{X} = \{X_1, \ldots, X_N\}$, and there are $2N$ interventions denoted $a_{i,x} = do(X_i =x)$ for $i\in [N]$ and $x\in \{0,1\}$. The empty intervention $do()$, which corresponds to the observational distribution is denoted as $a_0$. These $2N+1$ interventions constitute the arms $\mathcal{A}= \{a_{i,x} \mid i \in [N], x\in \{0,1\}\} \cup \{a_0\}$ of the bandit instance. A causal bandit algorithm is a sequential decision making process that at each time $t$, makes an intervention $a_t \in \mathcal{A}$, and observes the sampled values of the nodes in $\mathbf{V}$ including the value of the node $Y$. The values of nodes $V \in \mathbf{V}$, $X \in \mathbf{X}$ and $Y$ sampled at time $t$ are denoted as $V(t), X(t)$, and $Y(t)$ respectively. Throughout the paper we use $i, x$, and $a$ to index the sets $[N]$, $\{0,1\}$, and $\mathcal{A}$ respectively. The expected reward corresponding to intervention $a_{i,x}\in \mathcal{A}$ and $a_0 \in \mathcal{A}$ is denoted as $\mu_{i,x} = \mathbb{E}[Y\mid do(X_i =x)]$ and $\mu_0 = E[Y]$. We study the causal bandit problem with respect two standard objectives in bandit literature: simple and cumulative regret.

\textbf{Simple Regret}: The expected simple regret of an algorithm $\texttt{ALG}$ that outputs arm $a_T$ at the end of $T$ rounds is defined as $r_{\texttt{ALG}}(T) = \max_{a\in \mathcal{A}} \mu_a - \mu_{a_T}$.

\textbf{Cumulative Regret}: Let \texttt{ALG} be an algorithm that plays arm $a_t$ at time $t\in [T]$. Then the expected cumulative regret of $\texttt{ALG}$ at the end of $T$ rounds is defined as $R_{\texttt{ALG}}(T) = \max_{a\in \mathcal{A}} \mu_a\cdot T - \sum_{t\in [T]}\mu_{a_t}$.


Throughout this paper, we assume that the intervenable nodes are binary, distribution of any intervenable node $X_i$ conditioned on its parents $\mathbf{Pa}(X_i)$ is Bernoulli. We assume without loss of generality that $X_i \prec Y$ where $\prec$ is a topological order on $\mathcal{G}$. In \SRM\ (Sec. \ref{sec: simple regret for general graphs}) we assume that the input is an SMCG that is identifiable with respect to the intervenable variables $\mathbf{X}$ and in \CRM\ (Sec. \ref{sec: cumulative regret}), we assume the input causal graph has no UCs and the underlying distribution $\mathbb{P}$ is strictly positive i.e. $\mathbb{P}(\mathbf{v})>0$ for all $\mathbf{v}$.


Other than the above, our algorithms do not make any other structural assumptions and are therefore significantly general compared to the previous works \citep{LattimoreLR16, LU2020, NairPS21, Lu2021}. However, we would like to note that the results in our main theorems i.e. Thms. \ref{theorem: UB-SR}, \ref{theorem: LB-Tree}, \ref{theorem: LB-given-q}, \ref{theorem: UB-CRM} are stated assuming that all the c-components have bounded size implying that $k_i=O(1)$ for all $i\in [N]$ and that the total number of observable parents of nodes in c-component $S_i$, i.e. $|\cup_{V\in S_i} \mathbf{Pa}(V)|$ is also bounded above by a constant. Note that this clearly implies that the indegree $d_i = |\mathbf{Pa}(X_i)|$ is also $O(1)$. These help us describe the results more cleanly highlighting the main parameters of importance for this work. But our algorithms work even without these assumptions. We note that
these assumptions are common in the causal inference literature \citep{Acharya2018, BhattacharyyaGKMV20}.