



\section{Experiments}\label{sec: experiments}

In this section, we validate our results empirically. In Experiment $1$, we compare the simple regret of our proposed algorithm \SRM\ to two baseline \MAB\ algorithms: uniform exploration (\UE) and successive rejects (\SR) \citep{AudibertBM10}. In Experiment $2$, we compare \SRM\ with a simple regret minimization algorithm for \CB\ (referred to as \PI\ or Propagating Inference from here onwards), given in Algorithm $3$ in \cite{YabeHSIKFK18}. While implementing \PI\ as described in \cite{YabeHSIKFK18}, we faced multiple issues that we had to resolve. Details are provided in App. \ref{secappendix:  yabe-et-al-issues}. In Experiment $3$, we compare the simple regret of \SRM\ with baselines \UE, \SR\ as $m$ increases. In Experiment $4$,  we compare expected cumulative regret of \CRM\ and \UCB\ \citep{AuerCF02} when observational arm is the best arm ($a^{*} = a_0$) validating first part of Thm. \ref{theorem: UB-CRM}. In Experiment $5$, we compare expected cumulative regret of \CRM\ and \UCB\ on random CBNs with $a^{*}\neq a_0$ validating second part of Thm. \ref{theorem: UB-CRM}.

\begin{figure}[!ht]
\centering
  \begin{subfigure}[b]{0.65\columnwidth}
    \includegraphics[width=\linewidth]{figures/regret_T_plot.pdf}
    \caption{{\small \SRM\ vs.\UE\ ,\SR\ }}
         \label{fig:regret-vs-T}
  \end{subfigure}
  \hfill %%
  \begin{subfigure}[b]{0.65\columnwidth}
    \includegraphics[width=\linewidth]{figures/regret_T_pi.pdf}
    \caption{{\small \SRM\ vs.\PI\ }}
        \label{fig:regret-vs-T_pi}
  \end{subfigure}
  \caption{Simple Regret vs. $T$}
\end{figure}


\textbf{Experiment $1$ (Simple Regret vs. $T$, \SRM\ vs. \UE, \SR):} This experiment compares the expected simple regret of \SRM\ with \UE\ and \SR\ as $T$ increases. We run the algorithms on 50 CBNs, where every constructed CBN $C$, has $100$ intervenable nodes with $m(\mathcal{C})=9$. The CBNs are constructed as follows: a) randomly generate $50$ DAGs on $101$ nodes $X_1, \ldots,X_{100}, Y$ such that $X_1\prec \ldots \prec X_{100}\prec Y$ is a topological order in each such DAG,\ b) $\mathbf{Pa}(X_i)$ contains $\leq 2$ nodes chosen uniformly at random from $X_1, \ldots, X_{i-1}$, and $\mathbf{Pa}(Y)$ equals the set of all $X_i$s,\ c) $\mathbb{P}(X_i \mid \mathbf{Pa}(X_i)) = 0.5$ for $i\in [91]$ and $\mathbb{P}(X_i|\mathbf{Pa}(X_i)) = 1/18$ for $i\in [92,100]$,\ d) uniformly at random choose a $j\in \{92,\ldots,100\}$ and set $P(Y|X_1,\ldots, X_j=1,\ldots, X_{100}) = 0.5 + \epsilon$ and  $P(Y|X_1,\ldots, X_j=0, \ldots, X_{100}) = 0.5 - \epsilon'$ where $\epsilon = 0.3$ and $\epsilon' = q\epsilon/(1-q)$ for $q=1/18$.
Our choice of the conditional distributions in (c) ensures $m(\mathcal{C}) = 9$ for every generated CBN $\mathcal{C}$. 
Our strategy to generate CBNs is a generalization of the one used in \cite{LattimoreLR16}. For each of the $50$ CBN, we run \SRM, \MAB, \SR\ for multiple values of the time horizon $T$ in $[500, 2500]$ and average the regret over $100$ independent runs. We average the regret over the $50$ CBNs and plot mean regret vs. $T$ in Fig. \ref{fig:regret-vs-T}. Since $m \ll N$, we see that, \SRM\ has a much lower regret compared to \UE,\SR\ in accordance with  Thm. \ref{theorem: UB-SR}.

\textbf{Experiment $2$ (Simple Regret vs. $T$, \SRM\ vs. \PI):} This experiment compares the expected simple regret of \SRM\ with \CB\ as $T$ increases. We run the algorithms on 50 CBNs such that for every constructed CBN $\mathcal{C}$, it has $10$ intervenable nodes and $m(\mathcal{C})=5$. The CBNs are constructed as follows: a) randomly generate $50$ DAGs on $11$ nodes $X_1, \ldots,X_{10}$ and $Y$, and let $X_1\prec \ldots \prec X_{10}\prec Y$ be the topological order in each such DAG,\ b) $\mathbf{Pa}(X_i)$ contains at most $1$ node chosen uniformly at random from $X_1, \ldots, X_{i-1}$, and $\mathbf{Pa}(Y) = \{X_6, \dots, X_10\}$,\ c) $\mathbb{P}(X_i \mid \mathbf{Pa}(X_i)) = 0.5$ for $i\in [5]$ and $\mathbb{P}(X_i|\mathbf{Pa}(X_i)) = 1/10$ for $i\in [6,10]$,\ d) uniformly at random choose a $X_j$ from $\mathbf{Pa}(Y)$ and set the CPD of $Y$ as $\mathbb{P}(Y|\ldots, X_j=1,\ldots) = 0.5 + \epsilon$ and  $\mathbb{P}(Y|\ldots, X_j=0, \ldots) = 0.5 - \epsilon'$ where $\epsilon = 0.3$ and $\epsilon' = q\epsilon/(1-q)$ for $q=1/10$.
The choice of the conditional probability distributions (CPDs) in (c) ensures $m(\mathcal{C}) = 5$ for every CBN $\mathcal{C}$ that is generated. 
Our strategy to generate CBNs is a generalization of of the one used in \cite{LattimoreLR16} to define parallel bandit instances with a fixed $m$. For each of the $50$ random CBN, we run \SRM\ and \CB\ for multiple values of the time horizon $T$ in $[500, 2500]$ and average the regret over $30$ independent runs. We calculate the mean regret over the $50$ random CBNs and plot mean regret vs. $T$ in Fig. \ref{fig:regret-vs-T_pi}. As seen, \SRM\ has a much lower regret compared to \PI\ which incurs $\tilde{O}(\sqrt{N/T})$ regret in comparison to \SRM's regret of $\tilde{O}(\sqrt{m/T})$ (Theorem \ref{theorem: UB-SR}).

\begin{figure}[!ht]
\centering
  \begin{subfigure}[b]{0.65\columnwidth}
    \includegraphics[width=\linewidth]{figures/regret_m_100.pdf}
        \caption{$N = 100$}
        \label{fig:regret-vs-m-100}
  \end{subfigure}
  \hfill %%
  \begin{subfigure}[b]{0.65\columnwidth}
    \includegraphics[width=\linewidth]{figures/regret_m_200.pdf}
        \caption{$N = 200$}
        \label{fig:regret-vs-m-200}
  \end{subfigure}
  \caption{Simple Regret vs. $m$}
\end{figure}

\textbf{Experiment $3$ (Simple Regret vs. $m$):} This Exp. compares the expected simple regret of \SRM\ with \UE\ and \SR\ for CBNs with different values of function $m$ from the set $M = \{10+2k : k\in [20]\}$. For this experiment, we fix the time horizon to $T=1600$. We randomly generate $35$ DAGs on $N+1$ nodes $X_1, \ldots,X_{N}$ and $Y$. For each generated DAG $\mathcal{G}$ and $m \in M$, we use the same process as Exp. $1$ to set the CPDs of $\mathcal{G}$. For each of the $35$ random CBNs thus obtained, we run \SRM, \MAB, \SR\ for time horizon $T$ and average the regret over $100$ independent runs. We repeat this Exp. for $N=100$ and $N=200$. For $N=100$, we plot the mean regret over all the $35$ random CBNs vs. $m$ in Fig. \ref{fig:regret-vs-m-100}. The same plot for $N=200$ is provided in Fig. \ref{fig:regret-vs-m-200}. Our plots validate the $\sqrt{m}$ dependence of regret (for fixed $T$) in the case of \SRM. We see that as $N$ increases (with $m$ fixed), regret of \SRM\ is constant (as shown in Theorem \ref{theorem: UB-SR}), whereas regret of \MAB\ and \SR\ increases (as indicated by their regret guarantees). Thus, for large $N$, \SRM\ is strictly better, for a wide range of values of $m$.

\begin{figure}[!ht]
\centering
  \begin{subfigure}[b]{0.65\columnwidth}
    \includegraphics[width=\linewidth]{figures/cumulative_regret_vs_T.pdf}
\caption{{\small $a^{*}=a_0$}} 
\label{fig:cumulative-regret-vs-T}
  \end{subfigure}
  \hfill %%
  \begin{subfigure}[b]{0.65\columnwidth}
     \includegraphics[width=\linewidth]{figures/cumulative_regret_vs_T_any.pdf}
        \caption[]{{\small $a^{*}\neq a_0$}} 	\label{fig:regret-vs-T-CRM-any-arm} 
  \end{subfigure}
  \caption{Cumulative Regret vs. $T$}
\end{figure}


\textbf{Experiment $4$ (Cumulative Regret vs.\ T, $a^*=a_0$):} In this experiment, we  compare cumulative regret of \CRM\ with \UCB\ for CBN on four nodes $X_1, X_2, X_3$, and $Y$. $X_1$ has no parents and is the only parent of $X_2, X_3$. Parents of $Y$ are $X_2, X_3$. We choose CPDs: $\mathbb{P}(X_1=1) = 0.5$, $\mathbb{P}(X_2=1|X_1)$ and $\mathbb{P}(X_3=1|X_1)$ are equal to $0.75X_1 + 0.25(1-X_1)$ and $P(Y=1|X_2, X_3) = \mathds{1}_{X_2=X_3}$. For this instance, it is easy to see that $\mathbb{P}(Y=1|do(X_2=x)) = P(Y=1|do(X_3=x)) = 0.5$ for $x\in \{0,1\}$ and $P(Y=1|do()) = 5/8$, implying that observational arm is the best arm. We average the cumulative regrets of \CRM\ and \UCB\ over $30$ independent runs. Fig. \ref{fig:cumulative-regret-vs-T} demonstrates that cumulative regret of \UCB\ increases and that of \CRM\ becomes constant for large $T$ (as shown in Thm. \ref{theorem: UB-CRM}). 


\textbf{Experiment $5$ (Cumulative Regret vs. T, $a^*\neq a_0$):} This experiment compares the cumulative regret of \CRM\ with \UCB\ as $T$ increases. The algorithms are run on 12 CBNs such that for every constructed CBN $\mathcal{C}$, it has $10$ intervenable nodes. The CBNs are constructed as follows: a) randomly generate $12$ DAGs on $11$ nodes $X_1, \ldots,X_{10}$ and $Y$, and let $X_1\prec \ldots \prec X_{10}\prec Y$ be the topological order in each such DAG,\ b) $\mathbf{Pa}(X_i)$ contains at most $1$ node chosen uniformly at random from $X_1, \ldots, X_{i-1}$, and $\mathbf{Pa}(Y)$ contains $X_i$ for all $i$,\ c) $\mathbb{P}(X_i \mid \mathbf{Pa}(X_i)) = 0.5$ for $i\in [10]$ and,\ d) uniformly at random choose a $X_j$ from $\mathbf{Pa}(Y)$ and set the CPD of $Y$ as $\mathbb{P}(Y|\ldots, X_j=1,\ldots) = 0.5 + \epsilon$ and  $\mathbb{P}(Y|\ldots, X_j=0, \ldots) = 0.5 - \epsilon'$ where $\epsilon = 0.1$ and $\epsilon' = q\epsilon/(1-q)$ for $q=1/2$, that is an interventional arm is the best arm. We average the cumulative regrets of \CRM\ and \UCB\ over $30$ independent runs. Fig. \ref{fig:regret-vs-T-CRM-any-arm} demonstrates that cumulative regret of \CRM\ gets better than that of \UCB\ for large $T$ (as shown in Thm. \ref{theorem: UB-CRM}).


