
%cameraready: from UAI v1


\documentclass[accepted]{uai2024} % for initial submission
%\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} 
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{amsbsy}
\usepackage{bbm}
\usepackage{dsfont}
\usepackage{array}
\usepackage{nicefrac}

\usepackage{subcaption}

\usepackage{algorithmic}
%\usepackage{algpseudocode}
\usepackage{algorithm}

% amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
%\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

\usepackage{tikz}
\usepackage{tikz-cd}
\usetikzlibrary{arrows}
\usetikzlibrary{shapes.geometric}
\tikzstyle{blacknode} = [circle,draw=black]
\tikzstyle{blacksquare} = [draw=black]
\definecolor{ao}{rgb}{0.0, 0.5, 0.0}

%\usepackage{algorithmicx}
%\usepackage{algpseudocode}
%\usepackage{algorithm}

\input{macros}

\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}
\newtheorem{example}[theorem]{Example}


%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\newcommand{\fmz}[1]{#1}

\title{Causally Abstracted Multi-armed Bandits}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<fabio.zennaro@uib.no>?Subject=Your UAI 2024 paper}{Fabio Massimo Zennaro}{}}
\author[2]{Nicholas Bishop}
\author[2]{Joel Dyer}
\author[3]{Yorgos Felekis}
\author[2]{Anisoara Calinescu}
\author[2]{Michael Wooldridge}
\author[3]{Theodoros Damoulas}
% Add affiliations after the authors
\affil[1]{%
    University of Bergen
}
\affil[2]{%
    University of Oxford
}
\affil[3]{%
    University of Warwick
  }
  
  \begin{document}
\maketitle

\begin{abstract}
  
  Multi-armed bandits (MAB) and causal MABs (CMAB) are established frameworks for decision-making problems. The majority of prior work typically studies and solves individual MAB and CMAB in isolation for a given problem and associated data. However, decision-makers are often faced with multiple related problems and multi-scale observations where joint formulations are needed in order to efficiently exploit the problem structures and data dependencies. Transfer learning for CMABs addresses the situation where models are defined on \textit{identical} variables, although causal connections may differ. In this work, we extend transfer learning to setups involving CMABs defined on potentially different variables, with varying degrees of granularity, and related via an abstraction map. Formally, we introduce the problem of causally abstracted MABs (CAMABs) by relying on the theory of causal abstraction in order to express a rigorous abstraction map. We propose algorithms to learn in a CAMAB, and study their regret. We illustrate the limitations and the strengths of our algorithms on a real-world scenario related to online advertising.
\end{abstract}

\section{Introduction}

\emph{Multi-armed bandit problems} (MABs) provide an established formalism to model real-world decision-making problems where an agent is required to repeatedly take an action whilst managing a trade-off between exploiting current knowledge or exploring new alternatives \citep{lattimore2020bandit}. The notion of causality has been explicitly introduced in the MAB formulation by modelling the system upon which an agent may act as a \emph{structural causal model} (SCM) \citep{pearl2009causality}. This has led to the definition of \emph{causal MABs} (CMABs), where the actions of an agent are identified with causal interventions, and the relation between an action and the outcome of the system is mediated by the SCM \citep{bareinboim2015bandits,lattimore2016causal}. 

Although MABs are often solved in isolation, growing interest has focused on the problem of transferring learned information across MABs, in order to more efficiently solve decision-making problems in different environments \citep{lazaric2012transfer}. Current methods for CMABs exploit common variables and structures between a source model and a target model, in order to transfer information \citep{zhang2017transfer}. 
In this work, we study the possibility of transferring information between CMABs that may be defined over related, but different, variables. Specifically, we consider the case where a decision-making problem is modelled through two CMABs at different levels of resolution providing two representations of the same system. This setup mirrors common real-world scenarios where observations and experiments for critical decision-making have been carried out at different scales. For instance, consider the common scenario of optimal advertisement placement: an online company may deploy different protocols to collect click-through data about its customers. These protocols may differ in the variables that are observed due to different technical constraints, legal requirements or management decisions at the time of collection. While data gathered from each protocol may be used to solve isolated CMABs, if the underlying models could be formally related to each other, information may be transferred from one CMAB to the other, thus improving the learning process. 

In order to transfer information across CMABs defined over different variables, we rely on the theory of causal abstraction (CA) to relate the causal structure of the CMABs \citep{rischel2020category,zennaro2023quantifying}. We thus define the causally abstracted MABs (CAMAB) problem, that is, the problem of learning across CMABs by exploiting a known abstraction map. Whereas current approaches to causal transfer require models to be defined on identical variables, our approach overcomes this limitation, and allows us to transfer samples and aggregate statistics from a low-level model to a high-level model in different settings. We propose algorithms for learning in CAMABs and analyze our results in terms of simple and cumulative regret. 



\paragraph{Contributions.} 
This paper introduces a framework based on CA for transferring information between CMABs \fmz{which moves beyond previous works connecting transfer learning and causal inference under the assumption of a fixed graphical structure \citep{rojastransfer,Pearl_Bareinboim_2011}}. We define the CAMAB problem (Sec. \ref{sec:CAMAB}), present a customized measure for the quality of abstraction (Sec. \ref{sec:Measuring}), and discuss a taxonomy of CAMAB settings (Sec. \ref{sec:taxonomy}). We then study the most representative CAMAB settings: we first derive negative results for an intuitive algorithm transporting the optimal action of the base CMAB (Sec. \ref{ssec:Scenario1}) and for an off-policy algorithm transporting the actions taken in the base CMAB (Sec. \ref{ssec:Scenario2}); then, we propose and analyze an algorithm based on the transfer of the expected values of the rewards and their upper bounds from the base CMAB (Sec. \ref{ssec:Scenario3}). We thus provide a broad overview of different approaches within our framework, and, at the end, we showcase our algorithms on a realistic problem (Sec. \ref{sec:Experiments}).










\section{Background}
%\textbf{Currently rewriting this section!}
In this section we review the definition of SCMs, integrate them into MABs to define CMABs, and finally formalize the notion of CA between SCMs. We refer the reader to App. \ref{app:Assumptions} for a discussion of the assumptions in our models.


\paragraph{Causal Models.}

A SCM encodes a causal system defined over a set of variables as follows: 

\begin{definition}[SCM \citep{pearl2009causality}]
    A structural causal model (SCM) is a tuple $\scm = \langle \envars,\exvars,\structfuncs, \prob(\exvars) \rangle$ where:
    \begin{itemize}
        \item $\envars = \{X_1,...,X_n\}$ is a set of $n$ endogenous variables, each defined on domain $\domain[X_i]$;
        \item $\exvars = \{U_1,...,U_n\}$ is a set of $n$ exogenous variables, each defined on domain $\domain[U_i]$;
        \item $\structfuncs = \{f_i,...,f_n\}$ is a set of $n$ measurable functions, one for each endogenous variable $X_i$; each function is defined $f_i: \domain[\envars] \setminus \domain[X_i] \times \domain[U_i] \rightarrow \domain[X_i]$;
        \item $\prob(\exvars)$ is a joint probability distribution over the exogenous variables $\exvars$.% factorizing as $\prod_i \prob(U_i)$.
    \end{itemize}
    An SCM $\scm$ admits an underlying directed acyclic graph (DAG) $\dirag_\scm = \langle V,E \rangle$, where $V$ is the set of vertices $V = \envars \cup\, \exvars$ and $E$ is the set of edges $E= \{ (V_i,V_j) \vert V_j \in \envars, V_i \in \envars \cup\, \exvars \textrm{ and } V_i \textrm{ in domain of } f_i\}$.
\end{definition}

A decision-maker may act on an SCM through interventions: 
\begin{definition}[Intervention \citep{pearl2009causality}]
    Given a SCM $\scm$, an intervention $\dointv(\mathbf{X} = \mathbf{x})$, shortened to $\dointv(\mathbf{x})$ when clear from the context, is an operator which, for every $X_i \in \mathbf{X} \subseteq \envars$ replaces the structural function $f_i$ in $\scm$ with the constant $x_i \in \domain[X_i]$.
\end{definition}


Note that an intervention $\dointv(\mathbf{x})$ on $\scm$ mutilates the graph $\dirag_\scm$ by removing all edges incoming to each variable $X \in \mathbf{X}$ and induces a new probability distribution $\prob(\envars \mid \dointv(\mathbf{X}= \mathbf{x}))$ over endogenous variables. 

\paragraph{Causal MABs.}
We now combine SCM with MABs following \cite{lattimore2016causal}. 
A CMAB $\bandit$ is defined by a SCM $\scm$ with a predesignated reward variable $Y \in \mathcal{X}$, and a set $\actionset$ of actions $\action_i$ which we equate to a set $\intervset$ of interventions $\dointv(\mathbf{x}_i)$ on $\scm$. Similarly, we take rewards $\rewardval$ generated by the reward process $\rewarddistr{\action_i}$ associated with action $\action_i$ to correspond to samples $y$ from $\prob(Y \vert \dointv(\mathbf{x}_i))$. We will assume that $\actionset$ is finite and that the reward values lie in the interval $[0, 1]$ almost surely.
We denote the expected reward for action $\action$ by $\mu_{\action} = \expval{Y\vert a}{Y}$; we also denote the optimal action as $\optaction=\argmax_{\action \in \actionset}\mu_{\action}$ and the optimal expected reward as $\mu^{*} = \max_{\action \in \actionset}\mu_{\action}$.
See Tab.\ref{tab:langageequivalence} for an equivalence between SCM and CMAB quantities.
%In a causal bandit, a learner repeatedly intervenes on an SCM $\mathcal{M}$ with a predesignated reward variable $Y \in \mathcal{X}$. Initially, the learner is provided with the induced graph $\mathcal{G}$ corresponding to $\mathcal{M}$ and a set of permissible interventions (or actions) $\mathcal{A}$. 
A learning agent interacts with $\scm$ over $T$ rounds; in each round $t$, it selects and performs action $\actiont{t}$, and collects reward $y^{(t)}$. 
%At the beginning of each round, the learner selects and applies an intervention $a_{t} \in \mathcal{A}$ and observes endogenous values drawn from $\mathbb{P}(\cdot \mid a_{t})$ including a reward $Y_{t}$. 
The goal of the agent is to adopt an intervention policy $\policyt{t}(\action)=\prob(\action)$ that minimises a regret-based performance benchmark. We consider two different notions of regret, the first of which is \emph{simple regret}:
\begin{equation}
    \sregr(T) = \mu^{*} - \expval{\policyt{T}}{\mu_{\actiont{T}}},
\end{equation}
where $\actiont{T}$ is an arm chosen by the learner after the time horizon has concluded. MABs evaluated under simple regret are often referred to as pure exploration problems, as the goal of the agent is simply to identify the best arm over the course of the time horizon, rather than play it repeatedly. The second benchmark we consider is \emph{cumulative regret}:
\begin{equation}
    \cregr(T) = T\mu^{*} - \sum^{T}_{t=1}\expval{\policyt{t}}{\mu_{\actiont{t}}},
\end{equation}
wherein the goal is to maximise its expected reward accumulated over the time horizon w.r.t. the learned policy. \fmz{A standard algorithm to solve MAB problems is offered by the \ucb{} algorithm \citep{lattimore2020bandit}}.
%which repeatedly takes the optimal action.

\paragraph{Causal Abstraction.}
Finally, we review the notion of abstraction allowing us to relate SCMs. 

\begin{definition}[Abstraction \citep{rischel2020category}]
    Given two SCMs $\scm$ and $\scmi$, an abstraction is a tuple $\abs = \langle \Rset, \amap, \alphamap{} \rangle$, where:
    \begin{itemize}
        \item $\Rset \subseteq \envars$ is a subset of {relevant variables} in $\scm$;
        \item $\amap: \Rset \rightarrow \envars'$ is a surjective map between relevant variables in the base model $\scm$ and variables in the abstracted model $\scmi$;
        \item $\alphamap{X'}: \domain[\amap^{-1}(X')] \rightarrow \domain[X']$ is a collection of surjective maps from the outcome of base variables onto outcomes of the abstracted variables.
    \end{itemize}
\end{definition}
%In short, 
An abstraction maps 
variables and realizations in the base SCM $\scm$ onto variables and realizations in the abstracted SCM $\scmi$. Notice that interventions can be immediately transported by abstracting the intervened variables and values. For simplicity, whenever clear from the context, we shorthand the application of $\abs$ as in Tab. \ref{tab:shorthandingalpha}.
In order to assess whether the causal effect of interventions is maintained,
the notion of interventional consistency is introduced:

\begin{definition}[Interventional consistency error \citep{rischel2020category,zennaro2023quantifying}] \label{def:IC_error}
    Given an abstraction $\abs$ from $\scm$ to $\scmi$, let $\mathcal{J}$ be a set of pairs $(\dointv(\mathbf{x}),\mathbf{Y})$ with $(\mathbf{X},\mathbf{Y}) \subseteq \envars^2$ and $\mathbf{X} \cap \mathbf{Y} = \emptyset$. The interventional consistency (IC) error $\abserr$ is defined as the greatest distance between the two paths on the following diagram:
    \begin{center}
    \begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=1.0, every node/.style={scale=1.0}]
    
    \node[] (M0_0) at (0,0) {$\mathbf{x}$};
    \node[] (M0_1) at (4,0) {$\mathbf{Y}$};
    \node[] (M1_0) at (0,-1.75) {$\mathbf{x'}$};
    \node[] (M1_1) at (4,-1.75) {$\mathbf{Y'}$};
    
    \draw[->]  (M0_0) to node[above,font=\small]{$\prob(\mathbf{Y}\vert \dointv(\mathbf{x}))$} (M0_1);
    \draw[->]  (M1_0) to node[above,font=\small]{$\prob(\mathbf{Y'}\vert \dointv(\mathbf{x'}))$} (M1_1);
    \draw[->]  (M0_0) to node[left,font=\small]{$\alphamap{\mathbf{X'}}$} (M1_0);
    \draw[->]  (M0_1) to node[left,font=\small]{$\alphamap{\mathbf{Y'}}$} (M1_1);
    
    \end{tikzpicture}
    \end{center}
    that is,
    \begin{equation}%\label{eq:abserr}
        \abserr = 
        \max_{(\dointv(\mathbf{x}),\mathbf{Y}) \in \mathcal{J}} \jsd ( \abs( \prob(\mathbf{Y}\vert \dointv(\mathbf{x}))), \prob(\mathbf{Y'}\vert \abs(\dointv(\mathbf{x}))) )
    \end{equation}
    where $\jsd(p,q)$ is the Jensen-Shannon (JS) distance between the distributions $p,q$ (see definition in App. \ref{app:AddDef}).
\end{definition}

 The IC error evaluates the worst-case difference between the distributions computed: (i) by intervening on $\scm$ and then abstracting; or (ii) by abstracting and then intervening on $\scmi$. An \emph{exact abstraction} (w.r.t. the set $\mathcal{J}$) has $\abserr=0$, meaning that interventions and abstractions commute.



\begin{table*}
\begin{centering}
\caption{Equivalence between the CMAB Language and the SCM Formalism.}\label{tab:langageequivalence}
\begin{tabular}{cccccccc}
\hline 
 & Action & Act. set & Reward & Rew. dom. & Rew. distr. & True exp. reward & Est. exp. reward \tabularnewline
\hline 
CMAB & $\action_i$ & $\actionset$ & $\rewardval$ & $\domain[\rewardvar]$ & $\rewarddistr{\action_i}$ 
%& $\expval{\rewarddistr{\action_i}}{\rewardvar}$ 
& $\mu_\action = \expval{\rewarddistr{\action_i}}{\rewardvar}$ 
%& $\estexpval{\rewarddistr{\action_i}}{\rewardvar}
& $\hat{\mu}_\action = \estexpval{\rewarddistr{\action_i}}{\rewardvar}$
\tabularnewline
\hline 
Causal & $\dointv(\mathbf{x}_i)$ & $\intervset$ & $y$ & $\domain[Y]$ & $\prob(Y\vert \dointv(\mathbf{x}_i))$ & $
\mu_{\dointv(\mathbf{x}_i)} = \expval{Y\vert \dointv(\mathbf{x}_i)}{Y} $ & $
\hat{\mu}_{\dointv(\mathbf{x}_i)} = \estexpval{Y\vert \dointv(\mathbf{x}_i)}{Y} $  \tabularnewline
\hline 
\end{tabular}
\par\end{centering}
\end{table*}













\section{CAMABS} \label{sec:CAMAB}

We are now ready to define the CAMAB problem:

\begin{definition}
A causally abstracted MAB (CAMAB) is defined by two CMABs, $\bandit=\langle\scm,\intervset\rangle$ and $\banditi=\langle\scmi,\intervseti\rangle$, and an abstraction $\abs = \langle \Rset,\amap,\alphamap{} \rangle$ from $\scm$ to $\scmi$.
\end{definition}


\begin{table*}
\captionsetup{justification=centerlast}
\caption{Shorthand Notation for the Application of an Abstraction to Variables, Values, Interventions, Distributions, and Abstracted Values.}\label{tab:shorthandingalpha}
\begin{centering}
\begin{tabular}{cccccc}
\hline 
Shorthand & $\abs(X)$ & $\abs(x_i)$ & $\abs(\dointv(x_i))$ & $\abs(\prob(Y))$ & $\abs^{-1}(x_i^\prime)$ \tabularnewline
\hline 
Exact expression & $\amap(X)$ & $\alphamap{\amap(X'_i)}(x_i)$ & $\dointv(\amap(X_i) = \alphamap{\amap(X_i)}(x_i))$ & $(\alphamap{\amap(Y')}{}_{\#}\prob)(Y')$ & $\{ x_j \in \domain[X] : \abs(x_j)= x_i^\prime\}$ \tabularnewline
\hline 
\end{tabular}
\par\end{centering}
\end{table*}




We will make the assumption that the models agree on the target, $m(Y) = Y'$, and that all relevant interventions in the base model can be mapped to the abstracted model.



\begin{example} \label{ex:absCMAB}
    Let $\bandit=\langle \scm, \intervset \rangle$ be a CMAB modelling a generic treatment-mediator-outcome SCM $\scm$, as in Fig. \ref{fig:CMAB-TMY}(left), with $\intervset =\{ \dointv(\texttt{T}=0), \dointv(\texttt{T}=1), \dointv(\texttt{T}=2)\}$ encoding treatments with different dosages. Let $\banditi=\langle \scmi, \intervseti \rangle$ be a simpler CMAB defined on a treatment-outcome SCM $\scmi$, as in Fig. \ref{fig:CMAB-TMY}(right), with $\intervseti =\{ \dointv(\texttt{T'}=0), \dointv(\texttt{T'}=1)\}$. A CAMAB is defined by an abstraction $\abs = \langle \Rset, \amap, \alphamap{} \rangle$ between the two SCMs $\scm$ and $\scmi$, such as:
    \begin{itemize}
        \item $\Rset = \{\texttt{T},\texttt{Y}\}$;
        \item $\amap(\texttt{T}) = \texttt{T}', \amap(\texttt{Y}) = \texttt{Y'}$;
        \item $\alphamap{\texttt{T}'}(0) = 0$, $\alphamap{\texttt{T}'}(1) = 1$, $\alphamap{\texttt{T}'}(2) = 1$, and $\alphamap{\texttt{Y}'}$ is the identity map.
    \end{itemize}
\end{example}

\fmz{A CAMAB provides then a formal way to relate two CMABs defined on different variables with incompatible domains, and it thus offers a formulation to express the problem of transporting information across heterogeneous models. CMABs can then be related to each other through an abstraction, and their data can be aggregated to solve both CMABs jointly more efficiently.}

\begin{figure}
    \centering
    \begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=0.8, every node/.style={scale=0.8}]		
			\node[blacknode] (X) at (0,0) {$\texttt{T}$};
			\node[blacknode] (Y) at (1.5,0) {$\texttt{M}$};
                \node[blacknode] (Z) at (3,0) {$\texttt{Y}$};
			\draw[->]  (X) to (Y) ;
			\draw[->]  (Y) to (Z);

                %\node[blacknode,scale=.7,blue] (aX) at (6,1.5) {$\abs(T)$};
                %\node[blacknode,scale=.7,blue] (aY) at (7.5,1.5) {$\abs(Y)$};
			%\draw[->, blue]  (aX) to (aY);

                \node[blacknode] (Xi) at (7,0) {$\texttt{T'}$};
                \node[blacknode] (Yi) at (8.5,0) {$\texttt{Y'}$};
			\draw[->]  (Xi) to (Yi);

                \draw[->, bend left, dashed]  (X) to node[above,font=\small]{$\abs$} (Xi);
                \draw[->, bend left=45, dashed]  (Z) to (Yi);
		\end{tikzpicture}
    \caption{Base model $\scm$ (left) and abstracted model $\scmi$ (right)}\label{fig:CMAB-TMY}
    
 
\end{figure}






\section{Measuring error in CAMABs} \label{sec:Measuring}
\fmz{An abstraction in a CAMAB may introduce approximations.} In order to estimate the discrepancy between the CMABs in a CAMAB, we rely on two measures. First, we introduce a variant of \emph{IC error} to quantify the gap between the pushforward of the base model via $\abs$ and the true abstracted model. As we focus on the reward variable $Y$, we reduce the set $\mathcal{J}$ in Def. \ref{def:IC_error} to the set $\mathcal{I}$ of pairs of the form $(\dointv(\mathbf{x}),Y)$ corresponding to relevant interventions. Moreover, we substitute the JS distance $\jsd$ with the Wasserstein 2-distance $\dwassII$ (see definition in App. \ref{app:AddDef}); both distances have the {1-Lipschitz} property that guarantees a bound on the error in the composition of abstractions \citep{rischel2020category}, but $\dwassII(p,q)$ also allows us to bound the distance between the expected values of the distributions $p$ and $q$. We then redefine the IC error as:
\begin{equation}
\abserr =\max_{\dointv(\mathbf{x}) \in \mathcal{I}} \dwassII ( \abs(\prob({Y}\vert \dointv(\mathbf{x}))), \prob({Y'}\vert \abs(\dointv(\mathbf{x})))).
\end{equation}
This error quantifies the worst-case distributional distance between the reward distribution computed by (i) running action $\dointv(\mathbf{x})$ and pushing forward the resulting distribution via $\alphamap{Y'}$ in the abstracted CMAB, and (ii) abstracting the base action to $\abs(\dointv(\mathbf{x}))$ and computing the reward distribution in the abstracted CMAB.
A {zero IC error} in a CAMAB guarantees that the abstraction of the distribution of the outcome under the base action $\action$ is the same as the distribution of the outcome under the abstracted action $\abs(\action)$, that is,
$
    \abs(\prob(Y \vert \dointv(\mathbf{x}))) = \prob(Y' \vert \abs(\dointv(\mathbf{x}))).
$
  
We also introduce a second measure of \emph{reward discrepancy} to quantify the gap between the reward distribution in the base model and the reward distribution in the pushforward of the base model:
\begin{equation}
\rdiscr = \max_{\dointv(\mathbf{x}) \in \mathcal{I}} \dwassII (\prob({Y}\vert \dointv(\mathbf{x})), \abs( \prob({Y}\vert \dointv(\mathbf{x})))),
\end{equation}
where the distributions are defined on $[0,1]$. This error quantifies the worst-case distributional distance between the reward distribution computed by (i) running action $\dointv(\mathbf{x})$ in the base model, and  (i) running the same action $\dointv(\mathbf{x})$ and pushing forward the resulting distribution via $\alphamap{Y'}$ in the abstracted CMAB.
 A \emph{zero discrepancy} guarantees that the distribution of the rewards is identical in the base and under the pushforward. 

 These two measures immediately allow to bound the difference in expected values (see App. \ref{app:Proofs} for the formal proof): 

\begin{proposition}[Bound on difference of expected rewards]\label{prop:exprewardsbound}
    Given a CAMAB, the difference in expected rewards $|\mu_{\action_i} - \mu'_{\abs(\action_i)}|$ is bound by $\abserr + \rdiscr$.
\end{proposition}

\section{Transferring information in CAMABs} \label{sec:taxonomy}
In our general CAMAB formulation  transfer may be characterized along multiple dimensions, such as:
\begin{itemize}
    \item \emph{Quantities to abstract:} whether transferring individual quantities (e.g.: actions, rewards) or model-based statistical quantities (e.g.: expectations).
    \item \emph{Synchronicity of abstraction:} whether transferring information online (solving the CMABs synchronously) or offline (solving the CMABs asynchronously);
    \item \emph{Direction of abstraction:} whether transferring information from base to abstracted model, or vice versa.
\end{itemize}

We consider a series of three representative scenarios in order of increasing complexity, covering meaningful approaches to the solution of a CAMAB. 
%leaving a detailed exhaustive investigation of all the possibilities to future work. 
%We start from a simple setup with an intuitive solution, and build towards more challenging and realistic protocols. 
\fmz{We focus, in particular, on the dimension of the \emph{quantity to abstract}, assuming an offline transfer of information from the base to the abstracted model.}
For each scenario, we propose algorithmic solutions, analyze their behaviour in terms of regret, and provide examples; we refer to App. \ref{app:Proofs} for formal proofs, App. \ref{app:Algorithms} for pseudo-code of our algorithms, App. \ref{app:Experiments} for details about our examples, and to the online repository\footnote{\url{https://github.com/FMZennaro/causally-abstracted-multiarmed-bandits}} for the simulation code.

\begin{figure*}
    \centering
	\begin{subfigure}{.35\textwidth}
		\begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=.8, every node/.style={scale=.8}]
    
            \node[] (M0_0) at (0,0) {$\mathbf{x^*}$};
            \node[] (M0_1) at (3,0) {${Y}$};
            \node[] (EM0) at (5.5,0) {$\expval{{Y}\vert \dointv(\mathbf{x})}{Y}$};
            \node[] (M1_0) at (0,-2) {$\alphamap{\mathbf{X'}(\mathbf{x^*})}$};
            \node[] (M1_1) at (3,-2) {${Y'}$};
            \node[] (EM1) at (5.5,-2) {$\expval{{Y'}\vert \dointv(\mathbf{x'}))}{Y'}$};
            
            \draw[->]  (M0_0) to node[below,font=\small]{$\prob({Y}\vert \dointv(\mathbf{x}))$} (M0_1);
            \draw[->]  (M1_0) to node[above,font=\small]{$\prob({Y'}\vert \dointv(\mathbf{x'}))$} (M1_1);
            \draw[->,blue]  (M0_0) to node[left,font=\small]{$\alphamap{\mathbf{X'}}$} (M1_0);
            \draw[->]  (M0_1) to node[left,font=\small]{$\alphamap{{Y'}}$} (M1_1);
            \draw[->]  (M0_1) to node[below,font=\small]{$\expvalop{}$} (EM0);
            \draw[->]  (M1_1) to node[above,font=\small]{$\expvalop{}$} (EM1);
            \draw[->,dashed]  (EM0) to (EM1);
            
            \end{tikzpicture}
            \caption{}
		  \label{fig:besttransfer}
	\end{subfigure}
	\begin{subfigure}{.32\textwidth}
		\centering
    		\begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=.8, every node/.style={scale=.8}]
    
            \node[] (M0_0) at (1,0) {$\mathbf{x}$};
            \node[] (M0_1) at (3,0) {${Y}$};
            \node[] (EM0) at (5.5,0) {$\expval{{Y}\vert \dointv(\mathbf{x})}{Y}$};
            \node[] (M1_0) at (1,-2) {$\mathbf{x'}$};
            \node[] (M1_1) at (3,-2) {${Y'}$};
            \node[] (EM1) at (5.5,-2) {$\expval{{Y'}\vert \dointv(\mathbf{x'}))}{Y'}$};
            
            %\draw[->,ao]  (M0_0) to node[below,font=\small]{$\prob(\mathbf{Y}\vert \dointv(\mathbf{x}))$} (M0_1);
            \draw[->,ao]  (M0_0) to node[below,font=\small]{} (M0_1);
            %\draw[->,blue]  (M1_0) to node[above,font=\small]{$\prob({Y'}\vert \dointv(\mathbf{x'}))$} (M1_1);
            \draw[->,blue]  (M1_0) to node[above,font=\small]{} (M1_1);
            \draw[->,blue]  (M0_0) to node[left,font=\small]{$\alphamap{\mathbf{X'}}$} (M1_0);
            \draw[->,ao]  (M0_1) to node[left,font=\small]{$\alphamap{{Y'}}$} (M1_1);
            \draw[->]  (M0_1) to node[below,font=\small]{} (EM0);
            \draw[->,blue]  (M1_1) to node[above,font=\small]{} (EM1);
            \draw[->,dashed]  (EM0) to (EM1);
            \draw[->,ao]  (M1_1.-20) to node[below,font=\small]{} (EM1.185);
            
            \end{tikzpicture}
		\caption{}
		\label{fig:imitation}
	\end{subfigure}
        \begin{subfigure}{.32\textwidth}
		\centering
    		\begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=.8, every node/.style={scale=.8}]
    
            \node[] (M0_0) at (1,0) {$\mathbf{x}$};
            \node[] (M0_1) at (3,0) {${Y}$};
            \node[] (EM0) at (5.5,0) {$\expval{{Y}\vert \dointv(\mathbf{x})}{Y}$};
            \node[] (M1_0) at (1,-2) {$\mathbf{x'}$};
            \node[] (M1_1) at (3,-2) {${Y'}$};
            \node[] (EM1) at (5.5,-2) {$\expval{{Y'}\vert \dointv(\mathbf{x'}))}{Y'}$};

            % Q(a)
            % \draw[->,blue]  (M0_0) to node[below,font=\small]{} (M0_1);
            % \draw[->,blue]  (M0_1) to node[below,font=\small]{{\color{black}$\expval_Y$}} (EM0);
            \draw[->,blue]  (M0_0) to node[below,font=\small]{} (M0_1);
            \draw[->,blue]  (M0_1) to node[below,font=\small]{} (EM0);
            
            % Q(a')
            % \draw[->,orange]  (M1_0) to node[above,font=\small]{} (M1_1);
            % \draw[->,orange]  (M1_1) to node[above,font=\small]{} (EM1);
            \draw[->]  (M1_0) to node[above,font=\small]{} (M1_1);
            \draw[->]  (M1_1) to node[above,font=\small]{} (EM1);

            %Q(alpha(a))
            % \draw[->,ao]  (M0_0.-30) to node[below,font=\small]{} (M0_1.210);
            % \draw[->,ao]  (M0_1) to node[left,font=\small]{{\color{black}$\alphamap{{Y'}}$}} (M1_1);
            % \draw[->,ao]  (M1_1.-30) to node[below,font=\small]{} (EM1.200);
            \draw[->]  (M0_1) to node[left,font=\small]{{\color{black}$\alphamap{{Y'}}$}} (M1_1);

            %
            % \draw[->,red]  (M0_0.30) to node[below,font=\small]{} (M0_1.150);
            % \draw[->,red]  (M0_1.30) to node[below,font=\small]{} (EM0.160);
            \draw[->,blue,dashed]  (EM0) to node[right,font=\small]{$\alphamap{{\expvalop{}}}$}(EM1);

            %
            % \draw[->,yellow]  (M0_0) to node[left,font=\small]{{\color{black}$\alphamap{\mathbf{X'}}$}} (M1_0);
            % \draw[->,yellow]  (M1_0.30) to node[above,font=\small]{} (M1_1.158);
            % \draw[->,yellow]  (M1_1.30) to node[above,font=\small]{{\color{black}$\expval_{Y'}$}} (EM1.160);
            \draw[->]  (M0_0) to node[left,font=\small]{{\color{black}$\alphamap{\mathbf{X'}}$}} (M1_0);
            
            
            \end{tikzpicture}
		\caption{}
		\label{fig:Qtransfer}
	\end{subfigure}
	%\vspace{-2\baselineskip}
	\caption{Diagrams illustrating the (a) \transferoptimum, (b) \imitation, (c) \transferexpect{} algorithms}\label{fig:protocols}
	%\vspace{-1\baselineskip}
\end{figure*}

\subsection{Transfer of optimal action} \label{ssec:Scenario1}

Let us consider a CAMAB where the base CMAB $\bandit$ was solved using a standard CMAB algorithm (see Alg. \ref{alg:direct}), so that at timestep $T$ an optimal action $\action_o$ has been identified.
As abstraction learning algorithms often work by minimizing the IC error \citep{zennaro2023jointly,felekis2023causal}, it may seem sensible that an exact abstraction would allow to solve the CMAB problem $\banditi$ simply by transferring the optimal action. We could then consider the following intuitive protocol: we use the abstraction $\abs$ to transport the optimal action in $\bandit$ to $\banditi$, and keep choosing action $\abs(\action_o)$. 
This defines a new \emph{transfer-optimum} (\transferoptimum) algorithm, as in Alg. \ref{alg:transfer-optimum}.
This protocol is very efficient, requiring only one computation for transferring $\action_o$. However, we now show a negative result:
\begin{proposition}[Biasedness of \transferoptimum]\label{prop:transferoptimum_biasedness}
    Assuming an exact abstraction $\abs$ and the optimality of the learned action $\action_o = \optaction$, it is not guaranteed that $\abs(\action_o) = \optactioni$.
\end{proposition}

Although counterintuitive at first, this proposition can be proved through counterexamples as follows.

\begin{example}\label{ex:counterexample1}
    Consider the CAMAB in Fig.\ref{fig:CMAB-TMY} where $\scm$ is defined on binary variables. As long as the base composed mechanism $f_\texttt{Y}(f_\texttt{M}())$ is symmetric, and $f_{\texttt{Y}'}() = f_\texttt{Y}(f_\texttt{M}())$, then a zero IC error may be achieved by taking both $\alphamap{\texttt{T}'}$ and $\alphamap{\texttt{Y}'}$ as anti-diagonal matrices. Then $\abserr=0$, but $\abs(\optaction) \neq \optactioni$. See Fig. \ref{fig:counterexample2_2} in Appendix for a concrete illustration.
\end{example}

The symmetry in the mechanisms allows us to swap the labels of interventions and outcomes, while preserving consistency. This limit case may be unlikely in real scenarios, where causal asymmetries rule out such a possibility. Moreover, since in a MAB the domain $\domain[Y]$ has an implicit ordering, we can prevent this by requiring $\alphamap{Y'}$ to be \emph{order-preserving}. There is however another counterexample.

\begin{example} \label{ex:counterexample2}
    Let $\bandit$ and $\banditi$ be CMABs with the same underlying DAG in Fig. \ref{fig:CMAB-TMY}(right). Let us define an order-preserving zero-error abstraction $\abs$ between them. If the domains of the outcomes can be different $\domain[\texttt{Y}] \neq \domain[\texttt{Y}']$, we can be able to choose $\domain[\texttt{Y}']$ so that $\sum_{y' \in \domain[\texttt{Y}]} y' \abs(\prob(\texttt{Y}\vert \dointv(\optaction))) \leq \sum_{y' \in \domain[\texttt{Y}]} y' \prob(\texttt{Y}'\vert \dointv(\optactioni)))$, thus implying that $\abs(\optaction) \neq \optactioni$. See Fig. \ref{fig:counterexample1} for a concrete illustration.
\end{example}

Even with an exact abstraction $\abs$ and an order-preserving map $\alphamap{Y'}$, the weighting of the interventional distributions by the different values in $\domain[Y]$ and $\domain[Y']$ can lead to different expected values. 
In general, these examples make sense as an exact abstraction guarantees that commuting interventions and abstraction produces the same distribution, but it does not guarantee that maxima in the base distribution will necessarily be mapped onto maxima in the abstracted model, that is:
\begin{align}
    \abs(\prob(Y \vert \dointv(\mathbf{x}))) &= \prob(Y' \vert \abs(\dointv(\mathbf{x}))) \not\Rightarrow \label{eq:identitydistr} \\
     \abs \left(\argmax_{\dointv(\mathbf{x}) \in \intervset} \mu_{\dointv(\mathbf{x})} \right) &=  \argmax_{\dointv(\mathbf{x'}) \in \intervseti} \mu'_{\dointv(\mathbf{x'})}  \label{eq:identitymaxes}
\end{align}

Here, we are not interested in the commutativity of the diagram of Def. \ref{def:IC_error}, but in the outer path of a diagram as in Fig. \ref{fig:besttransfer}; \transferoptimum{} relies only on map $\alphamap{\mathbf{x'}}$ (blue arrow) improperly assuming the maximum of the expected values to be aligned (dashed arrow). However, the commutativity of the leftmost square encoded by Eq. \ref{eq:identitydistr} does not imply the commutativity of the outer square required by Eq. \ref{eq:identitymaxes}. 







Prop. \ref{prop:exprewardsbound} allows us to define a sufficient condition under which the maximum is preserved, as requested by Eq. \ref{eq:identitymaxes}.
\begin{lemma}[Sufficient condition for preservation of maximum]\label{lem:transferoptimum_suffcondition}
    Given a CAMAB, the maximum is preserved by the abstraction $\abs$ is $\abserr+\rdiscr \leq \frac{1}{2} \min_{\action \in \actionset : \gap{\action}>0} \gap{\action}$.
\end{lemma}

An alternative algebraic formulation of a sufficient condition is provided in Lemma \ref{lem:transferoptimum_algsuffcondition} in Appendix. 
Preservation of maxima immediately determines the asymptotic simple regret $\sregri_\algto(T)$ of the abstracted CMAB:

\begin{proposition}[Asymptotic regret of \transferoptimum]\label{prop:transferoptimum_simpleregret}
    For $T\rightarrow\infty$, using \transferoptimum, $\sregri_\algto(T) \rightarrow 0$ iff $\abs(\optaction) = \optactioni$.
\end{proposition}
Although on a finite time the algorithm on the base CMAB may wrongly hold as optimal an action $\action_o$ which happens to map to the optimal action, that is $\abs(\action_o)=\optactioni$, in the asymptotic regime maximum preservation is required to achieve zero simple regret in the abstracted CMAB.

\begin{example}\label{ex:scenario1}
    Let us consider two CAMABs as the one in Ex. \ref{ex:counterexample1}, both with an exact abstraction, but not necessarily maximum-preserving. Fig. \ref{fig:simul1} shows that, in case of maximum preservation (blue), the simple regret of \ucb{} (dashed) and $\transferoptimum{}$ (solid) both converges to $0$ as a function of the timesteps $T$; instead, in case of no maximum preservation (green), the simple regret of $\transferoptimum{}$ (solid) does not converge to $0$ as \ucb{} (dashed). The results agree with the statement of Prop. \ref{prop:transferoptimum_simpleregret}.
\end{example}

Even if appealing, we have shown that \transferoptimum{} is expected to perform satisfactorily only under the stringent condition that the maximum is preserved; otherwise, the abstracted CMAB will incur in a linear cumulative regret. 

\begin{algorithm}
	\caption{\transferoptimum{} Algorithm}\label{alg:transfer-optimum}
	\begin{algorithmic}[1]
		\STATE { \textbf{Input:} } CMAB $\bandit$, estimated optimal action $\action_o$ in the base model, abstraction $\abs$
		\STATE { \textbf{Output:} } optimal policy $\policy$
		
		\STATE Set $\policy(\abs(\action_o)) = 1$
		
		\STATE { \textbf{Return:} } $\policy$
		
	\end{algorithmic}
\end{algorithm}

\subsection{Transfer of actions} \label{ssec:Scenario2}

\begin{algorithm*}
	\caption{\imitation{} Algorithm}\label{alg:imitation}
	\begin{algorithmic}[1]
		\STATE { \textbf{Input:} } CMAB $\bandit$, set of trajectories $\trajectoryset=\{(\actiont{t},\rewardvalt{t})\}_{t=1}^T$ from base model, abstraction $\abs$
		\STATE { \textbf{Output:} } optimal policy $\policy$
		
		\STATE Initialize expected rewards $\estexpvalt{Y\vert \action_i}{Y}{0}$, auxiliary statistics $\suppstatst{0}$ \COMMENT{Setup the params}
		\FOR{$t = 1 ... T$ }{
			
			\STATE Select $a^{(t)} \leftarrow \abs(\mathcal{D}[0,t])$ \COMMENT{Action-translation}
			
			\STATE Receive $\rewardvalt{t} \distributes \rewarddistr{\actiont{t}}$ \COMMENT{Reward-collection}
			
			\STATE Compute $\estexpvalt{Y\vert \actiont{t}}{Y}{t},\suppstatst{t} \leftarrow \operatorname{update}\left(\estexpvalt{Y\vert \actiont{t}}{Y}{t-1}, \suppstatst{t-1},\actiont{t},\rewardvalt{t}\right)$ \COMMENT{Update stats}
		} 
		\ENDFOR

        \STATE Compute $\policyt{T} \leftarrow \alg\left(\estexpvalt{Y\vert \actiont{t}}{Y}{T}, \suppstatst{T}\right)$ \COMMENT{Evaluate policy}
		
		\STATE { \textbf{Return:} } $\policyt{T}$
		
	\end{algorithmic}
\end{algorithm*}

Instead of transferring the optimal action, we may consider an algorithm inspired by \emph{off-policy reinforcement learning} (RL) where, at each timestep $t$, the abstracted agent translates the outcome of the decision-making of the base agent and takes action $\abs(\actiont{t})$. This process is illustrated by the blue arrows in Fig. \ref{fig:imitation} and codifies an \emph{imitation} algorithm (\imitation), as in Alg. \ref{alg:imitation}. In RL terms, the abstracted agent uses the \emph{behaviour policy} $\abs(\policyt{t})$ to train its \emph{target policy} $\policyit{t}$ \citep{sutton2018reinforcement}. 
Unfortunately, the IC error itself is not informative regarding the distance between $\abs(\policyt{t})$ and $\policyit{t}$,
as the commutativity implied by zero IC error does not relate to the distribution of actions:
\begin{eqnarray}
    \abs(\prob(Y \vert \dointv(\mathbf{x}))) &=& \prob(Y' \vert \abs(\dointv(\mathbf{x}))) \not\Rightarrow \\
     \prob(\dointv(\mathbf{x'})) &=& \prob(\abs(\dointv(\mathbf{x}))).
\end{eqnarray}



Still, as actions are run in the abstracted CMAB, it follows:
 \begin{lemma}[Unbiasedness of \imitation] \label{lem:imitation_unbiasedness}
     Under the coverage assumption that $\policyt{t}$ has non-zero probability for every action $\actiont{t}$, the estimates $\hat{\mu}_{\dointv(\mathbf{x})}$ of \imitation{} are unbiased.
 \end{lemma}





Granted the coverage assumption, it holds that the expected rewards learned directly on the abstracted CMAB or learned on the abstracted CMAB by translating the actions of the base CMAB will be equal.
However, CMAB algorithms will not always satisfy the coverage assumption; instead, algorithms like \ucb{} will learn to choose the optimal action $\optaction$. For a finite $T$, the confidence in the estimate of the expected rewards will depend on the number of times each action $\actioni$ is tested under \imitation. Let $\mathcal{K}(\actioni_{i}) = |\abs^{-1}(\actioni_i)|$ be the size of the set of base actions mapping to $\actioni_{i}$; then, we can derive the following result in relation to \ucb{}:
\begin{proposition}[Confidence of \imitation] \label{prop:imitation_confidence}
    Given a CAMAB,
    assume we have run \ucb{} for $T$ steps on $\scm$. For \imitation{} to reach the same level of confidence in $\hat{\mu}_{\actioni_i}$ 
    %its estimates of the reward of action $\actioni_i$ 
    as running \ucb{} for $T$ steps on $\scmi$, it must hold $N(\mathcal{K}(\actioni_i) -1) + \left( \sum_{\action_{j} \in \abs^{-1}(\actioni_i)}\frac{1}{\gap{\action_j}^2} - \frac{1}{\gap{\actioni_i}^2} \right) \geq  0$ where $N>0$ is a constant term.
\end{proposition}

The first term of the inequality accounts for a constant number of time each action has to be sampled; in the abstracted CMAB, an action $\actioni_i$ aggregates the constant component from all the base actions in $\abs^{-1}(\actioni_i)$. The second term of the inequality accounts for an additional number of times each action has to be sampled according to its optimality gap $\gap{\actioni_i}$; again, in the abstracted CMAB, an action $\actioni_i$ again aggregates the gap from all the base actions $\action_j \in \abs^{-1}(\actioni_i)$. 

A similar reasoning can be followed to discuss the regret when running \imitation{} against running another algorithm directly on the abstracted CMAB:
 \begin{lemma}[Cumulative regret for \imitation] \label{lem:imitation_cumregret}
    Given a CAMAB, assume we have run $\alg$ for $T$ steps on $\scm$.
     The difference in cumulative regret between running $\alg$ or \imitation{} on $\scmi$ is $\expval{\alg}  
{\sum_{t=0}^T 
\sum_{\actioni_i \in \actionseti} 
{\abs(
\prob(\actionit{t}=\actioni_i)) 
\mu_{\actioni_i}}} - \expval{\alg} {\sum_{t=0}^T \sum_{\actioni_i \in \actionseti} \prob(\actionit{t}=\actioni_i) \mu_{\actioni_i}}$.
 \end{lemma}
The difference in cumulative regret is dependent on the weighting of the policies $\abs(\policy)$ and $\policyi$. More concretely, if we take $\alg$ to be \ucb{} we obtain:

\begin{proposition}[Regret lower bound of \imitation]\label{prop:imitation_regretlowerbound}
    Given a CAMAB, assume we have run \ucb{} for $T$ steps on $\scm$. For \imitation{} to have a lower regret bound than running \ucb{} for $T$ steps on $\scmi$, it must hold $3\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\left[1-\mathcal{K}(a'_{i})\right]+16\log T\sum_{a_{i}'\in\mathcal{A}'}\left[\frac{1}{\Delta(a_{i}')}-\Delta(a_{i}')\frac{1}{\sum_{\abs^{-1}(\actioni_i)}\Delta^{2}(a_{j})}\right] \geq 0$.
\end{proposition}


\begin{algorithm*}
	\caption{\transferexpect{} Algorithm}\label{alg:qtransfer}
	\begin{algorithmic}[1]
		\STATE { \textbf{Input:} } CMAB $\bandit$, estimate rewards $\estexpvalt{Y\vert \action_i}{Y}{t}$ from the base model at timestep $t$, abstraction $\abs$, time horizon $T$
		\STATE { \textbf{Output:} } optimal policy $\policy$
		
		\STATE Initialize expected rewards $\estexpvalt{Y'\vert \actioni_i}{Y'}{0}$ by abstracting $\estexpvalt{Y\vert \action_i}{Y}{t}$ \COMMENT{Expected value-translation}
        \STATE Reduce the action set $\intervseti$ to the set of optimistic interventions $\optintervseti$   \COMMENT{Action sub-selection}
            \STATE Initialize auxiliary statistics $\mathcal{S}^{(0)}$ and policy $\policyt{0}$ \COMMENT{Setup the params}
		\FOR{$t = 1 ... T$ }{
			
			\STATE Select $\actionit{t} \distributes \policyt{t-1}$ \COMMENT{Decision-making}
			
			\STATE Receive $\rewardvalt{t} \distributes \rewarddistr{\actionit{t}}$ \COMMENT{Reward-collection}
			
			\STATE Compute $\estexpvalt{Y'\vert \actionit{t}}{Y'}{t},\suppstatst{t} \leftarrow \operatorname{update}\left(\estexpvalt{Y'\vert \actionit{t}}{Y'}{t-1}, \suppstatst{t-1},\actionit{t},\rewardvalt{t}\right)$ \COMMENT{Update stats}
		
                \STATE Compute $\policyt{t} \leftarrow \alg\left(\estexpvalt{Y'\vert \actionit{t}}{Y'}{t}, \suppstatst{t}\right)$ \COMMENT{Update policy}
		} 
		\ENDFOR
		
		\STATE { \textbf{Return:} } $\policyt{T}$
		
	\end{algorithmic}
\end{algorithm*}




Notice how the confidence in Prop. \ref{prop:imitation_confidence} and the lower bound in Prop. \ref{prop:imitation_regretlowerbound} are in a trade-off: if many actions $\action_j$ with small gaps $\gap{\action_j}$ map onto $\actioni_i$, then \imitation{} will oversample $\actioni_i$ and be overconfident in its estimation; however, because of such an overestimation, its cumulative regret will be greater than just running \ucb{}. \imitation{} thus performs best when the optimal action $\optaction$ together with other actions $\action_j$ with small gaps $\gap{\action_j}$ are mapped onto $\optactioni_i$, as the oversampling will not factor in the regret. This agrees with the intuition of a good abstraction clustering together the optimal action with actions providing a close reward. This dynamic is confirmed in the asymptotic regime:

\begin{proposition}[Asymptotic regret for \imitation] \label{prop:imitation_asymptoticregret}
    For $T\rightarrow\infty$, the abstracted CMAB using \imitation{} achieves sub-linear cumulative regret iff $\abs(\optaction) = \optactioni$.
\end{proposition}




\begin{example}\label{ex:scenario2}
    Consider two CAMABs defined on the same CMABs, the first one with a non-maximum-preserving abstraction $\abs_1$ and the second with an abstraction $\abs_2$ that aggregates the maximum and another slightly suboptimal actions; see illustration in Fig. \ref{fig:scenario5a} and \ref{fig:scenario5b}, respectively. Fig. \ref{fig:simul2} shows that the difference between the regret of running \ucb{} directly on the abstracted model and \imitation{} can be either negative (for $\abs_1$ in blue) or positive (for $\abs_1$ in red), as explained by Prop. \ref{prop:imitation_regretlowerbound}.
    
    %On the other hand, if we consider the standard CAMAB of Ex. \ref{ex:counterexample1} with identities for $\alphamap{T'}$ and $\alphamap{Y'}$, then Fig. \ref{fig:scenario2_1b2} shows that in case of zero-IC-error abstraction the cumulative regret of the \imitation{} protocol is close to the cumulative regret of the UCB algorithm. 
    %In general, however, if we consider a scenario with more actions as in Ex. \ref{ex:absCMAB} and two different abstraction then, consistently with Prop. \ref{lem:imitation_cumregret}, the difference between the cumulative regret of UCB and the \imitation{} algorithm may be positive or negative, as shown in Fig. \ref{fig:scenario2_3}.
\end{example}
  
On finite-time, \imitation{} has less strict conditions for success than \transferoptimum; however, this assessment is not trivial, and \imitation{} still requires running the abstracted CMAB.









\subsection{Transfer of expected values} \label{ssec:Scenario3}
 
We now consider the possibility of transferring the expected value of the rewards in the base CMAB in order to warm-start the abstracted CMAB. This approach corresponds to the computation represented by the blue arrows in Fig. \ref{fig:Qtransfer}. 
Formally, if the outer square of Fig. \ref{fig:Qtransfer} were to commute, then it would hold that: 
\begin{equation}
    \alphaext(\mu_{\dointv(\mathbf{x})}) = \mu'_{\abs(\dointv(\mathbf{x}))},
    %\alphaext(\expval{Y\vert \dointv(\mathbf{x})}{Y}) = \expval{Y'\vert \abs(\dointv(\mathbf{x}))}{Y'},    
\end{equation}
guaranteeing that the abstraction of the expected rewards is the same as the expected rewards in the abstracted model. 
In order to define a map $\alphaext$, we extend $\alphamap{Y'}:\domain[Y] \rightarrow \domain[Y']$ to $\mathbb{R} \rightarrow \mathbb{R}$ by selecting from a function class $\functionclass$ a map $\alphaext$ that interpolates the set of pairings $\mathcal{D} = \{(y,\alphamap{Y'}(y))\}$, for $y \in \domain[Y]$.
We then initialize each expected reward $\hat{\mu}'_{\dointv(\mathbf{x}'_i)}$ in the abstracted CMAB by optimistically transferring the highest expected reward among the action mapping to $\mathbf{x}'_i$: 
\begin{equation}\label{eq:alphaexpval_transport}
  \hat{\mu}'_{\dointv(\mathbf{x}'_i)} = \max_{\mathbf{x}_j \in \abs^{-1}(\mathbf{x}'_i)} \alphaext(\hat{\mu}_{\dointv(\mathbf{x}_j)})
  %\estexpval{Y'|\dointv(\actioni_i)}{Y'} = \max_{\action_j \in \abs^{-1}(\actioni_i)} \alphaext(\estexpval{Y|\dointv(\action_j)}{Y})  
\end{equation}

With reference to Fig. \ref{fig:Qtransfer}, we can now derive a bound on the difference between the upper path $\alphaext(\mu_{\dointv(\mathbf{x})})$ denoting our approach, and the lower path $\mu'_{\abs(\dointv(\mathbf{x}))}$ denoting a standard CMAB algorithm:


\begin{figure*}
    \centering
	\begin{subfigure}{.24\textwidth}
		\includegraphics[scale=0.26]{img/simul1.png}
        \caption{}
		\label{fig:simul1}
	\end{subfigure}
	\begin{subfigure}{.24\textwidth}
		\includegraphics[scale=0.26]{img/simul2.png}
		\caption{}
		\label{fig:simul2}
	\end{subfigure}
        \begin{subfigure}{.24\textwidth}
		\includegraphics[scale=0.27]{img/simul3.png}
		\caption{}
		\label{fig:simul3}
	\end{subfigure}
        \begin{subfigure}{.24\textwidth}
		\includegraphics[scale=0.27]{img/simul4.png}
		\caption{}
		\label{fig:simul4}
	\end{subfigure}
	%\vspace{-2\baselineskip}
	\caption{(a) Simple regret for \transferoptimum{} from Ex. \ref{ex:scenario1} for an exact and maximum-preserving abstraction (blue lines) and an exact but not maximum-preserving abstraction (green lines). (b) Regret difference for \imitation{} from Ex. \ref{ex:scenario2} using abstractions aggregating values differently (red and blue lines). (c) Cumulative regret for \transferexpect{} from Ex. \ref{ex:scenario3} for an abstraction preserving domains (blue lines) and an abstraction changing domains (green lines). See respective examples for further explanation. (d) Cumulative regret on the online advertising scenario.}\label{fig:results1}
	%\vspace{-1\baselineskip}
\end{figure*} 

\begin{proposition}[Bias of $\alphaext$]\label{prop:transferexp_biasedness}
    Assuming a linear interpolating function $\alphaext$, the difference $|\alphaext(\mu_{\dointv(\mathbf{x})}) - \mu'_{\abs(\dointv(\mathbf{x}))}|$ is upper bounded by $|\expval{Y\vert \dointv(\mathbf{x})}{\epsilon_{Y'}(Y)}| + \abserr$, where $\epsilon_{Y'}(Y)$ is the interpolation error introduced by $\alphaext$.
\end{proposition}

Thus, the quality of the transfer of the expected reward is a function of the interpolation $\alphaext$ and the IC error $\abserr$. Given an exact abstraction and a perfect linear interpolation, expected rewards in the base model can be exactly transported to the abstracted model. This bound allows us to define a confidence bound on the estimates set via $\alphaext$:
\begin{lemma}[Confidence bounds for $\alphaext$]\label{lem:transferexp_confidence}
    Assuming $\domain[Y]=\domain[Y']=[0,1]$ and assuming we used a linear interpolating function $\alphaext$ to compute $\hat{\mu}'_{\dointv(\mathbf{x}'_i)}$ as in Eq. \ref{eq:alphaexpval_transport}, with probability at least $1 - \delta$, it holds that 
    $| {\mu}'_{\dointv(\mathbf{x}'_i)} - \hat{\mu}'_{\dointv(\mathbf{x}'_i)} | \leq \kappa$,
    %$ |\expval{Y'|\dointv(\actioni_i)}{Y'} - \estexpval{Y'|\dointv(\actioni_i)}{Y'}| \leq \kappa$, 
    where $\kappa = \sqrt{\frac{2\log(2/\delta)}{\counter{\dointv(\actioni_i)}}} + |\expval{Y\vert \dointv(\mathbf{x})}{\epsilon_{Y'}(Y)}| + \abserr$ and $\counter{\dointv(\mathbf{x}'_i)}$ counts the number of times action $\dointv(\mathbf{x}'_i)$ was taken.
\end{lemma}

We can then use this bound to restrict the action in the abstracted CMAB to a reduced set $\optintervseti$ of optimistic interventions, that is, interventions $\dointv(\mathbf{x}'_i)$ such that $\exists \dointv(\mathbf{x}'_j) \in \intervseti$:
\begin{equation}
    \hat{\mu}'_{\dointv(\mathbf{x}'_i)} + \kappa \geq 
    \hat{\mu}'_{\dointv(\mathbf{x}'_j)} - \kappa.
\end{equation}
We refer the algorithm that transfers the expected values via $\alphaext$, restricting the action set to $\optintervseti$, and the runs \ucb{} as \transferexpect{} (see Alg. \ref{alg:qtransfer}).  We then immediately get:

\begin{proposition}[Cumulative regret of \transferexpect]\label{prop:transferexp_regret}
    Given a CAMAB, assume we use \transferexpect{} to initialize the expected rewards of the abstracted CMAB, and then we run \ucb{} for $T$ steps. Then the cumulative regret we incur is bounded by:
    $
    \cregri_\algte (T) \leq 3 \sum_{\actioni_i \in \optintervseti} \gap{\actioni_i} + \sum_{\actioni_i \in \optintervseti: \gap{\actioni_i}>0} \frac{16 + log(T)}{\gap{\actioni_i}}
    $
\end{proposition}

Notice that \transferexpect{} relies on explicit knowledge of $\kappa$ 
%which depends on the IC error $\abserr$
which might not be available; in such case, the results on biasedness will still hold, but we would not be able to restrict actions to $\optintervseti$. We leave the problem of learning abstractions and estimating IC error to future work.

\begin{example}\label{ex:scenario3}
    Let us return a last time to the CAMAB of Ex. \ref{ex:counterexample1}, and let us modify the outcome $\texttt{Y}$ to be defined over $\domain[\texttt{Y}]=\{0,1,2\}$; thanks to the perfect interpolation and the low IC error, Fig. \ref{fig:simul3} shows that  \transferexpect{} (blue solid) allows for a lower regret compared to running \ucb{} on the abstracted CMAB (blue dashed); however, if we were to change the domain of $\texttt{Y}$ to $\domain[\texttt{Y}]=\{0.4,0.5,10\}$, thus inducing a higher IC and interpolation error $\alphaext$, then, as expected from Prop. \ref{prop:transferexp_biasedness}, \transferexpect{} (green solid) incurs in a higher regret than \ucb{} (green dashed).
\end{example}

\transferexpect{} provides an algorithm to compute and transfer bounds within a CAMAB. As a further extension, in App. \ref{app:RewardTransfer} we also present an method based on the transfer of individual rewards which joins the advantages of transferring individual quantities like \imitation{} and the algorithmic approach of \transferexpect. 



\section{Experiments} \label{sec:Experiments}

In this section, we present more experimental results aimed at (i) discussing the relation between our work and methods from the transfer literature, (ii) showcasing our algorithms on more realistic CMAB problems. Detailed results are provided in Appendix \ref{app:Experiments}.

Our CAMAB approach has a close resemblance to the \bucb{} algorithm for transfer learning proposed in \citet{zhang2017transfer}. \transferexpect{} can be seen as a generalization of \bucb{}, based on the idea of transporting bounds between related CMABs; indeed, while \bucb{} requires source and target SCMs to be defined on identical variables, our \transferexpect{} can relate SCMs defined on different variables. 
In appendix, we show an application of our methods to some of the specific scenarios investigated in \citet{zhang2017transfer} by recasting them in terms of abstractions.

Finally, we compare our algorithms on a more realistic CAMAB by reconsidering the online advertisement problem presented in the introduction. To define our base CMAB, we use the model presented in \citet{lu2020regret} based on data from Adobe (see Fig. \ref{fig:EmailCampaign}). For our abstracted CMAB, we design a smaller model that could have been drawn up by another department; specifically, we assume that the new model simplifies the purposes and products of the campaign and ignores the advertisement sending out time. Notice that in this scenario, $\bucb{}$ would not be applicable, as the two models in the CAMAB are defined over different variables.
Fig. \ref{fig:simul4} shows the cumulative regret of the proposed algorithms, highlighting how our CAMAB algorithms improve over directly running \ucb{} on the abstracted model - all the results are further explored in terms of the chosen abstraction in Appendix. 


\section{Related work}

The stochastic MAB problem has been studied extensively within the MAB literature; see \cite{lattimore2020bandit} for a thorough overview. In this setting, it is typically assumed that the reward distributions associated with each action are statistically independent. Variations of popular index-based algorithms, including \ucb, 
become suboptimal once structural relationships between reward distributions are assumed \citep{russo2014, lattimore17a}. 

To address this, various structural MABs have been proposed which consider different kinds of statistical dependence between reward distributions, such as linear reward structures \citep{lattimore17a, rusmevichientong2010linearly, valkokernel, li2019nearly} and their generalisation to nonstationary settings \citep{russacnonstat, zhao2020simple}. 
Similarly, no-regret learning algorithms have been designed for Lipschitz MABs \citep{magureanu14, kleinberg2019}, wherein similar actions have similar expected reward. For finite-arm settings, \cite{lazaric2013sequential} and \cite{lattimorestructure} propose  a \ucb-style algorithm for a structured MAB where the joint reward distribution belongs to a finite feasible set or to a parametrised family, respectively.
Inspired by \cite{combes2017}, \cite{van2023optimal} study a general setting wherein the reward distribution belongs to a known convex set.

The use of causal models to encode structural relationships in MABs was first proposed by \cite{bareinboim2015bandits} in the form of MABs with unobserved confounders (MABUC). The CMAB problem, in which actions are interventions on an SCM with known causal graph, was formally introduced by \cite{lattimore2016causal}, who designed an algorithm with sublinear simple regret guarantees. Since then, various \cite{lucucb} \ucb-style algorithms for CMABs have been proposed \citep{pmlr-v130-nair21a, roydsep, lucucb}.

\cite{lazaric2012transfer,van2023optimal} have studied the problem of transferring information across MABs, but only recently this problem has been extended to CMABs.
%has been considered In the context of CMABs, % the main attention has been devoted to optimally solving individual MAB problems, although, only recently interest has been drawn to . 
Transfer within the same model, but across different observational and interventional regimes, has long been explained by standard do-calculus \citep{tian2002general}, and extended by \citet{forney2017counterfactual} to the counterfactual regime. \citet{zhang2017transfer}, instead, %considered cases where do-calculus does not allow for transfer and 
relied on bounds to transfer information between CMABs defined on identical variables; our work tackles a similar problem while relying on CA to overcome the limitation of having the SCM defined over the same variables. Transfer of information between SCMs has also been modelled via generalized linear models \citep{feng2022combinatorial} and causal Bayesian optimization \citep{aglietti2020multi}.

Causal abstraction was first formalized by \citet{rubenstein2017causal} and \citet{beckers2018abstracting} as a relation between the interventional distributions of SCMs at different levels of abstraction. In this work, we follow the formulation in \citet{rischel2020category}, which explicitly defines a mapping between the variables of models related by an abstraction. Further related work is discussed in App. \ref{app:FurtherWork}.
   

\section{Conclusion}
In this paper we have considered how CMAB problems at different levels of resolution could be related via a formal abstraction map, and how such a map could be used to transfer information and improve learning. We have formulated the CAMAB problem, defined relevant measures, and proposed a simple taxonomy of CAMABs. We have then studied some representative scenarios, providing algorithms and theoretical analysis. Specifically: 
(i) In the first scenario (\transferoptimum) we showed that, for exact and non-exact abstractions alike, if we transfer the optimal action from the base CMAB we are not guaranteed that  $\abs(\optaction) = \optactioni$ (Prop. \ref{prop:transferoptimum_biasedness}); we discussed a sufficient condition for the preservation of the optimum (Lem. \ref{lem:transferoptimum_suffcondition}) and showed that, without preservation of the optimum, the abstracted CMAB using \transferoptimum{} incurs in a constant simple regret (Prop. \ref{prop:transferoptimum_simpleregret}).
(ii) In the second scenario (\imitation{}) we showed that running the abstracted CMAB using an exact or non-exact abstraction to transport actions, the confidence and the regret in the abstracted CMAB are in a trade-off (Prop. \ref{prop:imitation_confidence} and Prop. \ref{prop:imitation_regretlowerbound}), which highlighted situations when \imitation{} may be expected to perform better than \ucb{}; moreover, to reach an asymptotic sub-linear cumulative regret \imitation{} still needs optimum preservation (Prop. \ref{prop:imitation_asymptoticregret}).
(iii) Finally, in the third scenario (\transferexpect{}), we showed that we can extend the abstraction as $\alphaext$ and directly transport expected rewards; we proved that the bias introduced by abstracting is bounded by the IC and the interpolation error (Prop. \ref{prop:transferexp_biasedness}), and used this bound to define new confidence intervals (Lem. \ref{lem:transferexp_confidence}) and derive the cumulative regret of \transferexpect{} (Prop. \ref{prop:transferexp_regret}).

The theoretical and empirical results we have presented characterize the advantages of using an abstraction map, as well as its limitations. A naive use of an abstraction map may lead to sub-optimal results. In \transferoptimum, counterintuitively, even an exact abstraction was  no guarantee of a useful transfer.
% ; the standard consistency measure for abstraction was shown to be insufficient for CAMABs, and other criteria, such as preservation of the optimum, had to be assessed. 
Indeed, we have shown how transporting information between CMABs may depend on non-obvious details of the abstraction itself. In the case of \imitation, the confidence and the regret of the abstracted CAMAB were in a trade-off depending on how the abstraction aggregates actions and modified their optimality gaps; in the case of \transferexpect{} both IC error and interpolation error contributes to the quality of the final result. These non-trivial results highlight both the versatility of abstraction and the need for proper measures to evaluate its impact on a decision-making problem.

Summarising, \fmz{our work extends the scope of transfer learning across CMABs by laying}  a theoretical foundation (CAMAB) for transferring and exploiting information across decision-making problems observed at multiple levels of resolution via causal abstraction. 
Potential directions for future work include
the construction of efficient abstraction maps for CAMABs with well-aligned rewards, and the relationship between CAMABs and other
specialised frameworks for MABs, such as regional and structured bandits.

\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    NB, JD, AC, and MW acknowledge funding from a UKRI AI World Leading Researcher Fellowship awarded to Wooldridge (grant EP/W002949/1). MW and AC also acknowledge funding from Trustworthy AI - Integrating Learning, Optimisation and Reasoning (TAILOR), a project funded by European Union Horizon2020 research and innovation program under Grant Agreement 952215.
   YF: This scientific paper was supported by the Onassis Foundation - Scholarship ID: F ZR 063-1/2021-2022. TD acknowledges support from a UKRI Turing AI acceleration Fellowship [EP/V02678X/1].
\end{acknowledgements}
\bibliography{abstraction}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% APPENDIX
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\appendix
\onecolumn

\title{Causally Abstracted Multi-armed Bandits\\(Supplementary Material)}
\maketitle

\section{Additional Definitions} \label{app:AddDef}

In this section we provide additional definitions.

\begin{definition}[Jensen-Shannon Distance (JSD)] \label{ssec:JSD}
    Let $p$ and $q$ be two distributions strictly positive on the domain $\domain[X]$. The Jensen-Shannon distance is defined as: $$
    \jsd(p,q) = \sqrt{\frac{1}{2} d_{KL}\left(p;\frac{p+q}{2}\right) + \frac{1}{2} d_{KL}\left(q;\frac{p+q}{2}\right)}$$ where: $$d_{KL}(p,q) = -\sum_{x\in\mathcal{X}} p(x) \log \frac{p(x)}{q(x)}$$ is the Kullback-Leibler (KL) divergence.
\end{definition}

\begin{definition}[Wasserstein P-distance] \label{ssec:W2D}
    Let $p(x)$ and $q(x')$ be two distributions on the domain $\domain[X]$ with finite $P$ moments, the Wasserstein P-distance is defined as: $$
    \dwassP(p,q) = \inf_{\pi \in \Pi(p,q)} \left( \expval{\pi}{ d(x,x')^P } \right)^{\frac{1}{P}}$$  where $\pi$ is a joint distribution over $\domain[X] \times \domain[X]$ with marginals $p$ and $q$ from the set of all such possible joints $\Pi$, and $d(x,x')$ is a metric over $\domain[X] \times \domain[X]$.
\end{definition}
We take our $\dwassII(p,q)$ for $P=2$, $\domain[X]=\mathbb{R}$ or $\domain[X]=[0,1]$, and $d(x,x')$ as the Euclidean distance.

\section{Assumptions} \label{app:Assumptions}

In this section we provide a complete listing of all our assumptions together with their explanation.

\subsection{SCM Assumptions}
Our SCM aligns with the standard assumptions proposed by \citet{rischel2020category}:
\begin{itemize}
	\item[] (SCM1) \emph{finite set of variables}: the SCM $\scm$ is defined on a finite set $\envars$ of endogenous variables.
	\item[] (SCM2) \emph{finite domains for the variables}: each endogenous variable $X \in \envars$ is defined on a finite domain $\domain[X]$.

        Notice that this assumption allows us to encode the structural function $f_i$ of a SCM $\scm$ as stochastic matrices.
 
	\item[] (SCM3) \emph{acyclicity}: the DAG $\dirag_{\scm}$ entailed by $\scm$ is acyclic.
\end{itemize}

\subsection{MAB Assumptions} 
Our definition of MAB subscribes to a set of common assumptions in the literature:
\begin{itemize}
	\item[] (MAB1) \emph{independent reward distributions}: each reward distribution $\rewarddistr{\action_i}$ is independent from other random variables.
	\item[] (MAB2) \emph{stationary reward distributions}: reward distributions $\rewarddistr{\action_i}$ do not change in time.
	\item[] (MAB3) \emph{no context}: no contextual information is provided to the agent before taking the action.
        \item[] (MAB4) \emph{finite variance of the distributions}: $\var{\rewarddistr{\action_i}}{\rewardvar}<\infty$, for all $\action_i$. A stronger assumption is:
        \begin{itemize}
            \item[] (MAB4+) \emph{Bernoulli reward distributions}: $\rewarddistr{\action_i} = \bern{p_i}$ for all $\action_i$, where $\bern{p}$ is a Bernoulli random variable with parameter $p$.
        \end{itemize}
\end{itemize}



\subsection{CMAB Assumptions}
Our CMAB inherits some of the MAB assumptions (CMAB2-CMAB4), as well as introducing new ones (CMAB5-CMAB7): 
\begin{itemize}
	\item[] (CMAB2) \emph{stationary reward process}: the outcome of the target variable $Y$ does not change in time; this assumption is implied by the stationarity of the mechanisms of the SCM.
	\item[] (CMAB3) \emph{no context}: no observational data is provided before action.
	\item[] (CMAB4)  \emph{finite variance of the process}: $\var{Y \vert \dointv(\mathbf{x}_i)}{Y}<\infty$, for all $\mathbf{x}_i \in \intervset$.
	\item[] (CMAB5) \emph{limited set of intervenable variables}: only a subset of endogenous variables may be intervened upon \citep{lee2019structural,lu2020regret,aglietti2020causal}.
	\item[] (CMAB6) \emph{known DAG}: the DAG underlying the SCM is given. 
	%\item[] (CMAB7) \emph{observability of variables post-intervention}: the agent will collect observations from the whole SCM after performing an intervention \citep{lattimore2016causal}.
    \item[] (CMAB7) \emph{finite domain of the rewards}: reward values $\rewardval$ lie in the interval $[0, 1]$ almost surely.
\end{itemize}

\subsection{Abstraction Assumptions}
We consider abstractions that comply with the following assumptions:
\begin{itemize}
	\item[] (AB1) \emph{partially specified abstraction:} we will assume that the abstraction $\abs$ is specified w.r.t. to the relevant variables for the CMAB problem, but not necessarily completely specified. This assumption decomposes as follows:
    \begin{itemize}
        \item[] (AB1a) \emph{base interventions are on relevant variables:} for each action $\action_i$ corresponding to intervention $\dointv(\mathbf{X}_i = \mathbf{x}_i)$, then $\mathbf{X}_i \subseteq \Rset$. Violation of this assumption would mean that certain actions in the base CMAB do not have any equivalent in the abstracted CMAB.

        \item[] (AB1b) \emph{base interventions have an image:} for each action $\action_i$ corresponding to intervention $\dointv(\mathbf{X}_i = \mathbf{x}_i)$, then $\amap(\mathbf{X}_i) = \mathbf{X'}_j$, $\alphamap{\mathbf{X'}_j}(\mathbf{x}_i) = \mathbf{x}'_j$, and  $\dointv(\mathbf{X'}_j = \mathbf{x}'_j) = \actioni_j \in \actionseti$. Violation of this assumption would mean that certain actions in the base CMAB can not properly be mapped onto actions in the abstracted CMAB.

        \item[](AB1c) \emph{abstracted interventions have a counterimage:} For each intervention $\actioni_j \in \actionseti$ there is a corresponding intervention $\action_i \in \actionset$ mapping onto it via abstraction $\abs$. This assumption implies that the simplified CMAB $\banditi$ does not introduce new actions not available in the base CMAB $\bandit$.
        
    \end{itemize}
 
    If the abstraction were not known, then this could lead us to consider a CAMAB problem where we want to learn at the same time an optimal policy and an abstraction map.
	
	\item[] (AB2) \emph{abstraction not necessarily exact:} we will not assume the abstraction to be exact.
	
	\item[] (AB3)  \emph{agreement on target:} we will assume $m(Y) = Y'$, that is, the two models have unique target variables, which are mapped to each other; resolution on the variable can change. This assumption implies that no factors are conflated in the target variable in the high-level model.

\end{itemize}



\section{Proofs} \label{app:Proofs}

In this section we provide proofs for the lemmata and propositions in the paper.

\subsection{Proof of Proposition \ref{prop:exprewardsbound}}
\textbf{Proposition \ref{prop:exprewardsbound}} (Bound on difference of expected rewards). 
\emph{Given a CAMAB, the difference in expected rewards $|\mu_{\action_i} - \mu'_{\abs(\action_i)}|$ is bound by $\abserr + \rdiscr$.}

\emph{Proof.}
Let us consider a CAMAB defined on $\scm$ and $\scmi$. The IC error $\abserr$ evaluates the worst-case distance over possible interventions between the reward distribution in the pushforward of the base model and the abstracted model as:
$$
\abserr = \max_{\dointv(\mathbf{x}) \in \mathcal{I}} \dwassII ( \abs(\prob({Y}\vert \dointv(\mathbf{x}))), \prob({Y'}\vert \abs(\dointv(\mathbf{x})))).
$$
Because of the Wasserstein 2-distance this immediately provide also a bound on the worst case distance between the expected value in the pushforward of the base model and the abstracted model as:
$$
|\abs(\mu_{\dointv(\mathbf{x})}) - \mu'_{\abs(\dointv(\mathbf{x}))}| \leq \abserr.
$$
Furthermore, the reward discrepancy quantifies the worst-case distance over possible interventions between the reward distribution in the base model and in the pushforward of the base model as:
\begin{equation*}
\rdiscr = \max_{\dointv(\mathbf{x}) \in \mathcal{I}} \dwassII (\prob({Y}\vert \dointv(\mathbf{x})), \abs( \prob({Y}\vert \dointv(\mathbf{x})))),
\end{equation*}
Again, because of the Wasserstein 2-distance this immediately provide also a bound on the worst case distance between the expected value in the base model and the pushforward of the base model:
$$
|\mu_{\dointv(\mathbf{x})} - \abs(\mu_{\dointv(\mathbf{x})})| \leq \rdiscr.
$$
By the triangular inequality, we get:
$$
|\mu_{\dointv(\mathbf{x})} - \mu'_{\abs(\dointv(\mathbf{x}))}| \leq \abserr + \rdiscr.
$$
That is, $|\mu_{\action_i} - \mu'_{\abs(\action_i)}|\leq \abserr + \rdiscr$. $\QED$

\subsection{Proof of Proposition \ref{prop:transferoptimum_biasedness}}
\textbf{Proposition \ref{prop:transferoptimum_biasedness}} (Biasedness of \transferoptimum). 
\emph{Assuming an exact abstraction $\abs$ and the optimality of the learned action $\action_o = \optaction$, it is not guaranteed that $\abs(\action_o) = \optactioni$.}

\emph{Proof.}
We prove this proposition through two counterexamples.

\emph{First counterexample.}
In the first counterexample, we consider the setup in Fig. \ref{fig:counterexample2_1} and \ref{fig:counterexample2_2} where intrinsic symmetries allow for multiple exact abstractions. Notice that all variables are binary.

Let us first consider the case in Fig. \ref{fig:counterexample2_1}. 

The base SCM $\scm$ is defined on binary variables and has structural functions given by the following matrices $f_{T} = \left[\begin{array}{c}
			.8\\
			.2
		\end{array}\right]$, $f_{M} = \left[\begin{array}{cc}
			.2 & .8\\
			.8 & .2
		\end{array}\right]$, $f_{Y} = \left[\begin{array}{cc}
			.7 & .3\\
			.3 & .7
		\end{array}\right]$. The abstracted model $\scmi$ is also defined on binary variables and has the following structural functions $f_{T'} = f_{T}$ and $f_{Y'} = f_{Y} f_{M}$ as a matrix product.  
    Let $\abs$ be defined on $\Rset, \amap$ as in Example \ref{ex:absCMAB}, and let $\alphamap{T'}, \alphamap{Y'}$ being identity matrices.
 Fig. \ref{fig:counterexample2_1} summarizes the setup. Assuming $\intervset = \{ \dointv(T=0), \dointv(T=1) \}$, the IC error for this abstraction $\abs$ is $\abserr=0$.
Moreover, the true expected rewards in the base CMAB are $\expval{Y\vert\dointv(T=0)}{Y}=0.62$ and $\expval{Y\vert\dointv(T=1)}{Y}=0.38$, leading to the optimal action $\optaction$ being $\dointv(T=0)$. Identically, in the abstracted CMAB we have $\expval{Y'\vert\dointv(T'=0)}{Y'}=0.62$ and $\expval{Y'\vert\dointv(T'=1)}{Y'}=0.38$, similarly leading to the optimal action $\optactioni$ being $\dointv(T'=0)$. Thus, if we use the abstraction $\abs$ to translate the optimal action $\optaction$ we would map it to $\abs(\optaction) = \dointv(T'=0)$. Therefore, $\optactioni = \abs(\optaction)$.

Let us now consider the alternative abstraction in Fig. \ref{fig:counterexample2_2}, where domains, mechanisms and abstraction are exactly as above, except for $\alphamap{T'}, \alphamap{Y'}$ which are now anti-diagonal matrices. See Fig. \ref{fig:counterexample2_2} for an illustration of this setup. With respect to the same intervention set $\intervset$, it still holds that $\abserr=0$. Furthermore all the expected rewards and the optimal actions $\optaction,\optactioni$ are similarly unchanged. However, the translation of $\optaction$ via $\abs$ is now reversed, leading to $\abs(\optaction) = \dointv(T'=1)$. Therefore, with this alternative, still exact abstraction, it holds that $\optactioni \neq \abs(\optaction)$.
This difference is due to the symmetries in the base and abstracted models that allow for the preservation of distributions although the domains are swapped.



\begin{figure}
\begin{subfigure}{.48\textwidth}
        \begin{center}
		\begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=0.8, every node/.style={scale=0.8}]
	\tikzstyle{node_style} = [circle,draw=black]
	\node[node_style] (S) at (0,0) {T};
	\node[node_style] (T) at (2.5,0) {M};
	\node[node_style] (C) at (5,0) {Y};
	
	\node[node_style] (Si) at (0,-3) {T'};
	\node[node_style] (Ci) at (5,-3) {Y'};
	% \draw[->]  (-2,0) to node[above,font=\small]{$\left[\begin{array}{c}
	% 		\cdot\\
	% 		\cdot
	% 	\end{array}\right]$} (S);
	\draw[->]  (S) to node[above,font=\small]{$\left[\begin{array}{cc}
			.2 & .8\\
			.8 & .2
		\end{array}\right] $} (T);
	\draw[->]  (T) to node[above,font=\small]{$\left[\begin{array}{cc}
			.7 & .3 \\
			.3 & .7 
		\end{array}\right]$} (C);
	
	% \draw[->]  (-2,-3) to node[below,font=\small]{$\left[\begin{array}{c}
	% 		\cdot\\
	% 		\cdot
	% 	\end{array}\right]$} (Si);
	\draw[->]  (Si) to node[below,font=\small]{$\left[\begin{array}{cc}
			.7 & .3 \\
			.3 & .7 
		\end{array}\right]\left[\begin{array}{cc}
			.2 & .8\\
			.8 & .2
		\end{array}\right]$} (Ci);
	
	\draw[->,dashed]  (S) to node[left,font=\small]{$\left[\begin{array}{cc}
			1 & 0\\
			0 & 1
		\end{array}\right]$} (Si);
	\draw[->,dashed]  (C) to node[right,font=\small]{$\left[\begin{array}{cc}
			1 & 0\\
			0 & 1
		\end{array}\right]$} (Ci);
\end{tikzpicture}
        \caption{First example.}
		\label{fig:counterexample2_1}
        \end{center}
	\end{subfigure}
	\begin{subfigure}{.48\textwidth}
        \begin{center}
		\begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=0.8, every node/.style={scale=0.8}]
	\tikzstyle{node_style} = [circle,draw=black]
	\node[node_style] (S) at (0,0) {T};
	\node[node_style] (T) at (2.5,0) {M};
	\node[node_style] (C) at (5,0) {Y};
	
	\node[node_style] (Si) at (0,-3) {T'};
	\node[node_style] (Ci) at (5,-3) {Y'};
	% \draw[->]  (-2,0) to node[above,font=\small]{$\left[\begin{array}{c}
	% 		\cdot\\
	% 		\cdot
	% 	\end{array}\right]$} (S);
	\draw[->]  (S) to node[above,font=\small]{$\left[\begin{array}{cc}
			.2 & .8\\
			.8 & .2
		\end{array}\right] $} (T);
	\draw[->]  (T) to node[above,font=\small]{$\left[\begin{array}{cc}
			.7 & .3 \\
			.3 & .7 
		\end{array}\right]$} (C);
	
	% \draw[->]  (-2,-3) to node[below,font=\small]{$\left[\begin{array}{c}
	% 		\cdot\\
	% 		\cdot
	% 	\end{array}\right]$} (Si);
	\draw[->]  (Si) to node[below,font=\small]{$\left[\begin{array}{cc}
			.7 & .3 \\
			.3 & .7 
		\end{array}\right]\left[\begin{array}{cc}
			.2 & .8\\
			.8 & .2
		\end{array}\right]$} (Ci);
	
	\draw[->,dashed]  (S) to node[left,font=\small]{$\left[\begin{array}{cc}
			0 & 1\\
			1 & 0
		\end{array}\right]$} (Si);
	\draw[->,dashed]  (C) to node[right,font=\small]{$\left[\begin{array}{cc}
			0 & 1\\
			1 & 0
		\end{array}\right]$} (Ci);
\end{tikzpicture}
		\caption{First counterexample.}
		\label{fig:counterexample2_2}
        \end{center}
	\end{subfigure}
\caption{Models for the first counterexample.}


\end{figure}



\emph{Second counterexample.}
In the second counterexample, let $\bandit$ and $\banditi$ be CMABs with the same underlying DAG in Fig. \ref{fig:CMAB-TMY}(right). The base model $\scm$ is defined on a binary variable $T$ and a ternary variable $Y$; we assume the following structural functions $f_{T} = \left[\begin{array}{c}
			.8\\
			.2
		\end{array}\right]$ and $f_{Y} = \left[\begin{array}{cc}
			.25 & .45\\
			.35 & .1 \\
                .4 & .45
		\end{array}\right]$. We also assign to the domain of $Y$ the following values $\domain[Y] = \{1,1.1,1.2\}$. In the abstracted model $\scmi$ we only use binary variables, and we define the structural functions as $f_{T'}=f_{T}$ and $f_{Y'} = \left[\begin{array}{cc}
			.6 & .55\\
			.4 & .45
		\end{array}\right]$. We assign to the domain of $Y'$ the standard values $\domain[Y'] = \{0,1\}$.
    Finally, let $\abs$ be defined on $\Rset, \amap$ as in Example \ref{ex:absCMAB}, and let $\alphamap{T'}$ be the identity while $\alphamap{Y'}=\left[\begin{array}{ccc}
			1 & 1 & 0\\
			0 & 0 & 1
		\end{array}\right]$. See Fig. \ref{fig:counterexample1} for an illustration.

The outcome node of the base CMAB and the outcome node of the abstracted CMAB are defined over different domains. Notice that the maps $\alphamap{T'},\alphamap{Y'}$ are order-preserving, preventing us from exploiting symmetries as in the previous counterexample.

Assuming $\intervset = \{ \dointv(T=0), \dointv(T=1) \}$, the IC error is $\abserr = 0$. The true expected rewards in the base CMAB are $\expval{Y\vert\dointv(T=0)}{Y}=1.115$ and $\expval{Y\vert\dointv(T=1)}{Y}=1.1$, leading to the optimal action $\optaction$ being $\dointv(T=0)$.

If we were to use the abstraction $\abs$ to translate the optimal action $\optaction$ we would map it to $\abs(\optaction) = \dointv(T'=0)$. However, if we were to compute the true expected rewards in the abstracted CMAB we would get $\expval{Y'\vert\dointv(T'=0)}{Y'}=0.4$ and $\expval{Y'\vert\dointv(T'=0)}{Y'}=0.45$, meaning that the optimal action $\optactioni$ is $\dointv(T'=0)$.

Thus, $\optactioni \neq \abs(\optaction)$.
This difference is due to the different values in the domains of the outcome which, once accounted in the expected rewards, lead to different results. $\QED$

\begin{figure}
\begin{center}
\begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=0.8, every node/.style={scale=0.6}]
	\tikzstyle{node_style} = [circle,draw=black, minimum size=2.3cm]
	\node[node_style] (S) at (0,0) {T: \{0,1\} };
	\node[node_style] (C) at (4,0) {Y: \{1,1.1,1.2\}};
	
	\node[node_style] (Si) at (0,-3) {T': \{0,1\}};
	\node[node_style] (Ci) at (4,-3) {Y': \{0,1\}};
	% \draw[->]  (-2,0) to node[above,font=\small]{$\left[\begin{array}{c}
	% 		\cdot\\
	% 		\cdot
	% 	\end{array}\right]$} (S);
	\draw[->]  (S) to node[above,font=\small]{$\left[\begin{array}{cc}
			.25 & .45\\
			.35 & .1\\
                .4 & .45\\
		\end{array}\right] $} (C);
	
	
	% \draw[->]  (-2,-3) to node[below,font=\small]{$\left[\begin{array}{c}
	% 		\cdot\\
	% 		\cdot
	% 	\end{array}\right]$} (Si);
	\draw[->]  (Si) to node[below,font=\small]{$\left[\begin{array}{cc}
			.6 & .55\\
			.4 & .45
		\end{array}\right]$} (Ci);
	
	\draw[->,dashed]  (S) to node[left,font=\small]{$\left[\begin{array}{cc}
			1 & 0\\
			0 & 1
		\end{array}\right]$} (Si);
	\draw[->,dashed]  (C) to node[right,font=\small]{$\left[\begin{array}{ccc}
			1 & 1 & 0\\
			0 & 0 & 1
		\end{array}\right]$} (Ci);
\end{tikzpicture}
\end{center}
    \caption{Model for the second counterexample.}
    \label{fig:counterexample1}
\end{figure}


\subsection{Proof of Lemma \ref{lem:transferoptimum_suffcondition}}

\textbf{Lemma \ref{lem:transferoptimum_suffcondition}} (Sufficient condition for preservation of maximum).
\emph{Given a CAMAB, the maximum is preserved by the abstraction $\abs$ is $\abserr+\rdiscr \leq \frac{1}{2} \min_{\action \in \actionset : \gap{\action}>0} \gap{\action}$.}


\emph{Proof.}
From Prop. \ref{prop:exprewardsbound} we know that:
$$
|\mu_{\dointv(\mathbf{x})} - \mu'_{\abs(\dointv(\mathbf{x}))}| \leq \abserr + \rdiscr.
$$
This implies that in the abstraction any mean $\mu_{\action}$ may change as much as $\abserr + \rdiscr$.

Let $\mu^*$ be the mean of the optimal action and $\mu_{\action_o}$ the mean of the second-best action, that is $\action_o = \argmin_{\action \in \actionset : \gap{\action}>0} \gap{\action}$.
Then, for the maximum to be preserved, it must hold:
\begin{eqnarray}
    (\mu^* -  (\abserr + \rdiscr)) -(\mu_{\action_o} + (\abserr + \rdiscr)) & > 0 \\
    (\mu^* - \mu_{\action_o}) - 2(\abserr + \rdiscr) & > 0\\
    \frac{1}{2}\gap{\action_o} > \abserr + \rdiscr
\end{eqnarray}
Hence, $\abserr+\rdiscr \leq \frac{1}{2} \min_{\action \in \actionset : \gap{\action}>0} \gap{\action}$. $\QED$

\subsection{Lemma \ref{lem:transferoptimum_algsuffcondition}}

\begin{lemma}[Algebraic sufficient condition for preservation of maximum]\label{lem:transferoptimum_algsuffcondition}
    Given a CAMAB with a zero IC error abstraction, the maximum is preserved if there is no $\mathbf{b} \in \actionset$ such that $\mathbf{y'} \mathbf{A}_{Y'} (\mathbf{a^*}-\mathbf{b}) \mathbf{A}_{X'}^+ < 0$, where $A^+$ is the Moore-Penrose pseudo-inverse of $A$.
\end{lemma}

\emph{Proof.}
Given the optimal action $\optaction$, then, maximum-preservation means that for any action $b \in \actionset$, it holds that $\mu^{*} > \mu_{b}$ and $\mu'_{\abs(a^*)} > \mu_{\abs(b)}$.

Let us setup the matrix notation: $\mathbf{y} \in 1\times|\domain[Y]|, \mathbf{y'} \in 1\times|\domain[Y']|$ are the vectors of reward values; $\mathbf{\optaction},\mathbf{b} \in |\domain[Y]|\times 1, \mathbf{a'},\mathbf{b'} \in |\domain[Y']|\times 1$ are the interventional distributions of rewards for actions $\optaction,b$ and their corresponding abstracted actions $a',b'$; $\mathbf{A}_{X'} \in |\domain[X']| \times |\domain[X]|, \mathbf{A}_{Y'} \in |\domain[Y']| \times |\domain[Y]|$ encode the abstraction matrices $\alphamap{X'}, \alphamap{Y'}$ respectively.

We can now re-express the two condition for optimum preservation above in matrix form:
\begin{align*}
    \begin{cases}
    \mathbf{y}\mathbf{\optaction} > \mathbf{y}\mathbf{b}\\
    \mathbf{y'}\mathbf{a'} > \mathbf{y'}\mathbf{b'}
\end{cases}    
\end{align*}

This means we need the condition:
\begin{align*}
    \mathbf{y'}\mathbf{a'} & > \mathbf{y'}\mathbf{b'} \\
    \mathbf{y'}\mathbf{a'} - \mathbf{y'}\mathbf{b'} & > 0 \\
    \mathbf{y'}(\mathbf{a'} - \mathbf{b'}) & > 0 \\
\end{align*}
to hold for all $b' \in \actionseti$, or, equivalently, that there is no $b' \in \actionseti$ such that:
$$
\mathbf{y'}(\mathbf{a'} - \mathbf{b'}) \leq 0
$$

We can now redefine the abstracted interventional distributions $\mathbf{a'}$ and $\mathbf{b'}$ as functions of the base interventional distributions:
\begin{align*}
    \mathbf{a'} & =  \mathbf{A}_{Y'} \mathbf{a^*} \mathbf{A}_{X'}^+ \\
    \mathbf{b'} & =  \mathbf{A}_{Y'} \mathbf{b} \mathbf{A}_{X'}^+,
\end{align*}
where $A^+$ is the Moore-Penrose pseudo-inverse of $A$.

Substituting, we get the condition
\begin{align*}
    \mathbf{y'}(\mathbf{A}_{Y'} \mathbf{a^*} \mathbf{A}_{X'}^+ - \mathbf{A}_{Y'} \mathbf{b} \mathbf{A}_{X'}^+) & \leq 0 \\
    \mathbf{y'} \mathbf{A}_{Y'} (\mathbf{a^*}-\mathbf{b}) \mathbf{A}_{X'}^+ & \leq 0
\end{align*}
which we want to hold for no $b \in \actionset$. $\QED$





\subsection{Proof of Proposition \ref{prop:transferoptimum_simpleregret}}

\textbf{Proposition \ref{prop:transferoptimum_simpleregret}} (Asymptotic regret of \transferoptimum).
\emph{For $T\rightarrow\infty$, using \transferoptimum, $\sregri_\algto(T) \rightarrow 0$ iff $\abs(\optaction) = \optactioni$.}


\emph{Proof.} 
By running a CMAB algorithm on the base CMAB $\bandit$, we have that for $T \rightarrow \infty$, the algorithm will converge to the optimal action $\optaction$, therefore:
\begin{align*}
    \lim_{T \rightarrow \infty} \sregr_\alg(T) & = \expval{\alg}{\gap{\actiont{T}}} \\ & = \expval{\alg}{\gap{\optaction}} = 0
\end{align*}

At the same time $T$, in the abstracted CMAB we get
\begin{align*}
    \sregr'_\algto(T) & = \expval{\alg}{\gap{\actiont{T}}} \\
    & = \gap{\abs(\optaction)}
\end{align*}
where we lose the expected value because \transferoptimum{} defines a deterministic policy over $\abs(\optaction)$.

Now, because of the assumed preservation of the maximum:
\begin{align*}
    \sregr'_\algto(T) &  = \gap{\abs(\optaction)} \\
    & = \gap{\optactioni} = 0
\end{align*}
proving that, as the simple regret of the base CMAB converges to zero, so does the simple regret of the abstracted CMAB. $\QED$



\subsection{Proof of Lemma \ref{lem:imitation_unbiasedness}}

\textbf{Lemma \ref{lem:imitation_unbiasedness}} (Unbiasedness of \imitation)
\emph{Under the coverage assumption that $\policyt{t}$ has non-zero probability for every action $\actiont{t}$, the estimates $\hat{\mu}_{\dointv(\mathbf{x})}$ of \imitation{} are unbiased.}

\emph{Proof.} 
In \imitation{}, rewards samples are unbiased since the abstracted agent takes action $\abs(\actiont{t})$ and receives the reward
$$
\rewardvalit{t} \distributes \rewarddistr{\abs(\actiont{t})}
$$
sampled from the actual abstracted environment of $\scmi$. 

Now, the requirement that $\policyt{t}$ has non-zero probability for every action $\actiont{t}$, together with the surjectivity in assumption (AB1c), implies that:
$$
    \forall \actioni_i \in \actionseti \quad \exists \action_i\in \actionset \quad\textrm{s.t. } \abs(\action_i)=\actioni_i,
$$
and, therefore:
$$
\policyit{t}(\actioni_i) = \policyt{t}(\abs(\action_i)) \neq 0
$$
for all actions $\actioni_i \in \actionseti$. This means that every action $\actioni \in \intervseti$ will be taken with some probability. Thus, in the long run, the expected values of the interventional distributions $\prob(Y' \vert \dointv(\mathbf{x}_i))$ can be estimated in an unbiased way. $\QED$

\subsection{Proof of Proposition \ref{prop:imitation_confidence}}

\textbf{Proposition \ref{prop:imitation_confidence}} (Confidence of \imitation). 
\emph{Given a CAMAB,
    assume we have run \ucb{} for $T$ steps on $\scm$. For \imitation{} to reach the same level of confidence in $\hat{\mu}_{\actioni_i}$ 
    %its estimates of the reward of action $\actioni_i$ 
    as running \ucb{} for $T$ steps on $\scmi$, it must hold $N(\mathcal{K}(\actioni_i) -1) + \left( \sum_{\action_{j} \in \abs^{-1}(\actioni_i)}\frac{1}{\gap{\action_j}^2} - \frac{1}{\gap{\actioni_i}^2} \right) \geq  0$ where $N>0$ is a constant term.}

\emph{Proof.} 
Assume we run the \ucb{} algorithm on the abstracted CMAB $\banditi$. The number of times \ucb{} will test an action in order to estimate its mean is proportional to:
\begin{equation}
    \expval{\algucb}{\counter{\actioni_i}} \propto N + \frac{1}{\gap{\actioni_i}^2},
\end{equation}
where the first term is a constant $N$ while the second term depends on the squared action gap.

Assume also that we are running \ucb{} on the base CMAB $\banditi$ as well; then, under \imitation{}, the number of times that an arm will be pulled will be:
\begin{equation*}
    \expval{\algimit}{\counter{\actioni_i}} \propto \sum_{\action_{j}\in\actionset : \abs(\action_{j})=\actioni_{i}} \left(N +  \frac{1}{\gap{\action_j}^2}\right)
\end{equation*}
To achieve an estimate of the expected return with the same level of confidence as running \ucb{} direcly on $\banditi$, it is necessary for \imitation{} that the following relation hold:
\begin{equation*}
    \sum_{\action_{j}\in\actionset : \abs(\action_{j})=\actioni_{i}} \left(N +  \frac{1}{\gap{\action_j}^2}\right) \geq N + \frac{1}{\gap{\actioni_i}^2}
\end{equation*}
that is:
\begin{eqnarray*}
    \sum_{\action_{j}\in\actionset : \abs(\action_{j})=\actioni_{i}} \left(N +  \frac{1}{\gap{\action_j}^2}\right) - N + \frac{1}{\gap{\actioni_i}^2} & \geq &  0 \\
    N\left(\sum_{\action_{j}\in\actionset : \abs(\action_{j})=\actioni_{i}} 1 - 1 \right) + \sum_{\action_{j}\in\actionset : \abs(\action_{j})=\actioni_{i}} \frac{1}{\gap{\action_j}^2} - \frac{1}{\gap{\actioni_i}^2} & \geq &  0 \\
    N(\mathcal{K}(a'_{i}) -1) + \left( \sum_{\action_{j}\in\actionset : \abs(\action_{j})=\actioni_{i}}\frac{1}{\gap{\action_j}^2} - \frac{1}{\gap{\actioni_i}^2} \right) & \geq &  0
\end{eqnarray*}
where, $\mathcal{K}(\actioni_{i})$ is the size of cluster mapping to $\actioni_{i}$, that is, $\sum_{\action_{j}\in\actionset:\alpha(\action_{j})=\actioni_{i}}1$. $\QED$


\subsection{Proof of Lemma \ref{lem:imitation_cumregret}}

\textbf{Lemma \ref{lem:imitation_cumregret}} (Cumulative regret for \imitation).
\emph{Given a CAMAB, sssume we have run $\alg$ for $T$ steps on $\scm$.
     The difference in cumulative regret between running $\alg$ or \imitation{} on $\scmi$ is $\expval{\alg}  
{\sum_{t=0}^T 
\sum_{\actioni_i \in \actionseti} 
{\abs(
\prob(\actionit{t}=\actioni_i)) 
\mu_{\actioni_i}}} - \expval{\alg} {\sum_{t=0}^T \sum_{\actioni_i \in \actionseti} \prob(\actionit{t}=\actioni_i) \mu_{\actioni_i}}$.}

\emph{Proof.} By definition the cumulative regret when learning in the abstracted CMAB using algorithm $\alg$ is:
$$
\cregri(T) = \expval{\alg} {\sum_{j=t}^T \sum_{\actioni_i \in \actionseti} \prob(\actionit{j}=\actioni_i) \gap{\actioni_i}}.
$$
This quantity may be decomposed as:
\begin{align*}
    \cregri(T) & = \expval{\alg} {\sum_{j=t}^T \sum_{\actioni_i \in \actionseti} \prob(\actionit{j}=\actioni_i) (\expval{\rewarddistr{\optactioni}}{\rewardvar} - \expval{\rewarddistr{\actioni_i}}{\rewardvar})}\\
    & = \left(T-t\right) \expval{\rewarddistr{\optactioni}}{\rewardvar} - \expval{\alg} {\sum_{j=t}^T \sum_{\actioni_i \in \actionseti} \prob(\actionit{j}=\actioni_i) \expval{\rewarddistr{\actioni_i}}{\rewardvar}}
\end{align*}

For convenience, and with no loss of generality, let's take $t=0$ and  let's redefine the maximal reward that can be accumulated in the abstracted CMAB in $T$ timesteps as $M'_T=\left(T-0\right) \expval{\rewarddistr{\optactioni}}{\rewardvar}$. We then have:
$$
\cregri(T) = M'_T - \expval{\alg}{\sum_{j=0}^T \sum_{\actioni_i \in \actionseti} \prob(\actionit{j}=\actioni_i) \expval{\rewarddistr{\actioni_i}}{\rewardvar}}.
$$

In the case of learning via the \imitation{} algorithm $\algimit$, the cumulative regret is:
$$
\cregrimit(T) = \expval{\algimit} {\sum_{j=t}^T \sum_{\actioni_i \in \actionseti} \prob(\actionit{j}=\actioni_i) \gap{\actioni_i}}.
$$
Through an analogous decomposition, we can get:
$$
\cregrimit(T) = M'_T - \expval{\algimit} {\sum_{j=0}^T \sum_{\actioni_i \in \actionseti} \prob(\actionit{j}=\actioni_i) \expval{\rewarddistr{\actioni_i}}{\rewardvar}}.
$$

At each timestep of \imitation, the probability $\prob(\actionit{j}=\actioni_i)$ of selecting the action $\actioni_i$ is the same as the probability of algorithm $\alg$ on the base CMAB of selecting an action $\action_i$ such that $\abs(\action_i) = \actioni_i$. Thus, the cumulative regret can be re-expressed as:
$$
\cregrimit(T) = M'_T - \expval{\alg} {\sum_{j=0}^T \sum_{\action_i \in \actionset} \prob(\actiont{j}=\action_i) \expval{\rewarddistr{\abs(\action_i)}}{\rewardvar}}.
$$

Grouping all actions $\action_k$ such that $\abs(\action_k) = \actioni_i$ we get:
$$
\cregrimit(T) = M'_T - \expval{\alg}  
{\sum_{j=0}^T 
\sum_{\actioni_i \in \actionseti} 
\sum_{\action_k \vert \abs(\action_k) = \actioni_i} 
\prob(\actiont{j}=\action_i) \expval{\rewarddistr{\actioni_i}}{\rewardvar}}.
$$

This, in turn, is just the definition of the pushforward of the distribution $\prob(\actiont{j}=\action_i)$ via $\abs$:
$$
\cregrimit(T) = M'_T - \expval{\alg}  
{\sum_{j=0}^T 
\sum_{\actioni_i \in \actionseti} 
{\alphamap{X'_i}}_\#(
\prob)(\actionit{j}=\actioni_i) \expval{\rewarddistr{\actioni_i}}{\rewardvar}}.
$$
% which we shorthand as:
% $$
% \cregrimit = M'_T - \expval{\alg}  
% {\sum_{j=0}^T 
% \sum_{\actioni_i \in \actionseti} 
% \abs(
% \prob(\actiont{j}=\action_i)) \expval{\rewarddistr{\actioni_i}}{\rewardvar}}.
% $$

Now taking the difference $\cregri - \cregrimit$ we obtain:
\begin{align*}
    \cregri(T) - \cregrimit(T) & = 
    M'_T - \expval{\alg}{\sum_{j=0}^T \sum_{\actioni_i \in \actionseti} \prob(\actionit{j}=\actioni_i) \expval{\rewarddistr{\actioni_i}}{\rewardvar}} - 
    M'_T + \expval{\alg}  
{\sum_{j=0}^T 
\sum_{\actioni_i \in \actionseti} 
{\alphamap{X'_i}}_\#(
\prob)(\actionit{j}=\actioni_i) \expval{\rewarddistr{\actioni_i}}{\rewardvar}}\\
 & = \expval{\alg}  
{\sum_{j=0}^T 
\sum_{\actioni_i \in \actionseti} 
{\alphamap{X'_i}}_\#(
\prob)(\actionit{j}=\actioni_i) \expval{\rewarddistr{\actioni_i}}{\rewardvar}} - \expval{\alg} {\sum_{j=0}^T \sum_{\actioni_i \in \actionseti} \prob(\actionit{j}=\actioni_i) \expval{\rewarddistr{\actioni_i}}{\rewardvar}},
\end{align*}
which highlights the role of the abstraction in the difference between the regrets. $\QED$


\subsection{Proof of Proposition \ref{prop:imitation_regretlowerbound}}

\textbf{Proposition \ref{prop:imitation_regretlowerbound}} (Regret lower bound of \imitation).
\emph{Given a CAMAB, assume we have run \ucb{} for $T$ steps on $\scm$. For \imitation{} to have a lower regret bound than running \ucb{} for $T$ steps on $\scmi$, it must hold $3\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\left[1-\mathcal{K}(a'_{i})\right]+16\log T\sum_{a_{i}'\in\mathcal{A}'}\left[\frac{1}{\Delta(a_{i}')}-\Delta(a_{i}')\frac{1}{\sum_{\abs^{-1}(\actioni_i)}\Delta^{2}(a_{j})}\right] \geq 0$.}

\emph{Proof.} Assume we run \ucb{} on the abstracted CMAB $\banditi$. The upper bound on the regret over $T$ timesteps is:

\begin{equation*}
    \cregri_\algucb(T) \leq \sum_{\actioni_{i}\in\actionseti}\Delta(\actioni_{i})\left[3+\frac{16\log T}{\Delta^{2}(\actioni_{i})}\right],
\end{equation*}

Assume also that we are running \ucb{} on the base CMAB $\banditi$ as well; then, under \imitation, the regret will be lower bounded by:

\begin{equation*}
\cregri_\algimit(T) \leq\sum_{\actioni_{i}\in\intervseti}\Delta(\actioni_{i})\left[\sum_{\action_{j}\in\actionset:\alpha(\action_{j})=\actioni_{i}}3+\frac{16\log T}{\Delta^{2}(\action_{j})}\right].
\end{equation*}

For \imitation{} to achieve a regret lower bound lower than \ucb{} we need:

\begin{align*}
 \cregri_\algucb(T)-\cregri_\algimit(T) & \geq & 0\\
  \sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\left[3+\frac{16\log T}{\Delta^{2}(a_{i}')}\right]-\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\left[\sum_{a_{j}\in\mathcal{A}:\alpha(a_{j})=a'_{i}}3+\frac{16\log T}{\Delta^{2}(a_{j})}\right] & \geq & 0\\
 3\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')+\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\frac{16\log T}{\Delta^{2}(a_{i}')}-\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\sum_{a_{j}\in\mathcal{A}:\alpha(a_{j})=a'_{i}}3+\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\sum_{a_{j}\in\mathcal{A}:\alpha(a_{j})=a'_{i}}\frac{16\log T}{\Delta^{2}(a_{j})} & \geq & 0
\end{align*}

Let us define a variable $\mathcal{K}(a'_{i})$ that gives us the
size of cluster mapping to $a_{i}'$, that is, $\sum_{a_{j}\in\mathcal{A}:\alpha(a_{j})=a'_{i}}1$.
Then:

\begin{align*}
3\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')+\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\frac{16\log T}{\Delta^{2}(a_{i}')}-\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\mathcal{K}(a'_{i})3+\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\sum_{a_{j}\in\mathcal{A}:\alpha(a_{j})=a'_{i}}\frac{16\log T}{\Delta^{2}(a_{j})} & \geq & 0\\
3\left[\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')-\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\mathcal{K}(a'_{i})\right]+16\log T\left[\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\frac{1}{\Delta^{2}(a_{i}')}-\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\sum_{a_{j}\in\mathcal{A}:\alpha(a_{j})=a'_{i}}\frac{1}{\Delta^{2}(a_{j})}\right] & \geq & 0\\
3\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\left[1-\mathcal{K}(a'_{i})\right]+16\log T\left[\sum_{a_{i}'\in\mathcal{A}'}\frac{1}{\Delta(a_{i}')}-\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\frac{1}{\sum_{a_{j}\in\mathcal{A}:\alpha(a_{j})=a'_{i}}\Delta^{2}(a_{j})}\right] & \geq & 0\\
3\sum_{a_{i}'\in\mathcal{A}'}\Delta(a_{i}')\left[1-\mathcal{K}(a'_{i})\right]+16\log T\sum_{a_{i}'\in\mathcal{A}'}\left[\frac{1}{\Delta(a_{i}')}-\Delta(a_{i}')\frac{1}{\sum_{a_{j}\in\mathcal{A}:\alpha(a_{j})=a'_{i}}\Delta^{2}(a_{j})}\right] & \geq & 0 \; \QED
\end{align*}

%Notice:
%\begin{itemize}
%\item The first term depends only on the difference $\left[1-\mathcal{K}(a'_{i})\right]$: this is a penalty due to the fact that multiple base actions that are samples a constant number of time are aggregated into a single abstract action that will be oversampled. This term is zero only in case of a one-to-one mapping. 
%\item The second term term depends on how the ptimality gaps are changed $\left[\frac{1}{\Delta(a_{i}')}-\Delta(a_{i}')\frac{1}{\sum_{a_{j}\in\mathcal{A}:\alpha(a_{j})=a'_{i}}\Delta^{2}(a_{j})}\right]$. For each $a_{i}'$ this term is positive only if $\sum_{a_{j}\in\mathcal{A}:\alpha(a_{j})=a'_{i}}\Delta(a_{j})\geq\Delta(a_{i}')$, that is the gaps in the abstract are smaller than in the base. This makes sense as we get an advantage in using \imitation{} only if the abstract problem is more challenging than the base one.
%\end{itemize} 

\subsection{Proof of Proposition \ref{prop:imitation_asymptoticregret}}

\textbf{Proposition \ref{prop:imitation_asymptoticregret}} (Asymptotic regret for \imitation)
\emph{For $T\rightarrow\infty$, the abstracted CMAB using \imitation{} achieves sub-linear cumulative regret iff $\abs(\optaction) = \optactioni$.}

\emph{Proof.} By running on the base CMAB $\bandit$ a CMAB algorithm such as \ucb, we have that, for $T \rightarrow \infty$, the algorithm will converge to the optimal action $\optaction$, such that:
\begin{align*}
    \prob(\actiont{T} = \optaction) \rightarrow 1
\end{align*}

Consequently, in the abstracted CMAB we get
\begin{align*}
    \prob(\actionit{T} = \abs(\optaction)) \rightarrow 1.
\end{align*}

Now, because of the assumed preservation of the maximum, also the abstracted CMAB will converge to the optimal action $\optactioni$:
\begin{align*}
    \prob(\actionit{T} = \optactioni) \rightarrow 1,
\end{align*}
thus proving that the cumulative regret will grow sub-linearly. $\QED$













\subsection{Proof of Proposition \ref{prop:transferexp_biasedness}}

\textbf{Proposition \ref{prop:transferexp_biasedness}} (Bias of $\alphaext$)
\emph{Assuming a linear interpolating function $\alphaext$, the difference $|\alphaext(\mu_{\dointv(\mathbf{x})}) - \mu'_{\abs(\dointv(\mathbf{x}))}|$ is upper bounded by $|\expval{Y\vert \dointv(\mathbf{x})}{\epsilon_{Y'}(Y)}| + \abserr$, where $\epsilon_{Y'}(Y)$ is the interpolation error introduced by $\alphaext$.}

\emph{Proof.}
We can derive the bound on the difference between the upper path $\alphaext(\expval{Y\vert \dointv(\mathbf{x})}{Y})$ and the lower path $\expval{Y'\vert \alphamap{\mathbf{X'}}(\dointv(\mathbf{x}))}{Y'}$ as in Fig. \ref{fig:Qtransfer} by bounding the following transformations:
\begin{align}
    \alphaext(\expval{Y\vert \dointv(\mathbf{x})}{Y}) & \rightarrow \label{eq:p3step1} \\
    \expval{Y\vert \dointv(\mathbf{x})}{\alphaext(Y)} & \rightarrow \label{eq:p3step2}\\
    \expval{Y\vert \dointv(\mathbf{x})}{\alphamap{Y'}(Y)} & \rightarrow \label{eq:p3step3}\\
    \expval{Y'\vert \alphamap{{X'}}(\dointv(\mathbf{x}))}{Y'}. & \label{eq:p3step4}
\end{align}

To evaluate these transformations we will rely on the stated assumption:
\begin{itemize}
    \item[(AS1)] \emph{linear interpolation function:} we will assume that $\alphaext$ is learned from a family of linear functions $\functionclass$;
    \item[(AS2)] \emph{identity of domains:} we will expect the outcome domains in the base and abstracted model to be the same $\domain[Y]=\domain[Y']$; notice that, if that is not the case,  we can always redefine the domains of the outcomes to be the union $\domain[Y]\cup\domain[Y']$ of the domains. This assumptions guarantees that expected values may be equally computed on the domain of $Y$ or $Y'$.
\end{itemize}

(i) Let us consider the passage from Eq. \ref{eq:p3step1} to Eq. \ref{eq:p3step2} and let us bound the difference $|\alphaext(\expval{Y\vert \dointv(\mathbf{x})}{Y}) - \expval{Y\vert \dointv(\mathbf{x})}{\alphaext(Y)}|$.
We know that, in general, for a function $f$ with second-derivative bounded by $M$, we have:
$$
|\expval{p}{f(X)} - f(\expval{p}{X})| \leq M \var{p}{X}, 
$$
where $p$ is the distribution of $X$.
Assuming our extension $\alphaext$ has bounded second derivative, we get that:
$$
|\expval{Y\vert \dointv(\mathbf{x})}{\alphamap{\expvalop{Y'}}(Y)} - \alphamap{\expvalop{Y'}}(\expval{Y\vert \dointv(\mathbf{x})}{Y})| \leq M \var{Y}{Y}
$$
More specifically, under the assumption (AS1) of a linear extension, because of the zero second derivative, we get that:
$$
|\expval{Y\vert \dointv(\mathbf{x})}{\alphamap{\expvalop{Y'}}(Y)} - \alphamap{\expvalop{Y'}}(\expval{Y\vert \dointv(\mathbf{x})}{Y})| \leq 0.
$$
Hence, $\expval{\alphaext(Y\vert \dointv(\mathbf{x}))}{Y} = \expval{Y\vert \dointv(\mathbf{x})}{\alphaext(Y)}$. If assumption (AS1) were not to hold, then the passage from Eq. \ref{eq:p3step1} to Eq. \ref{eq:p3step2} would add a contribution greater than zero to the bound.



(ii) Let us consider the passage from Eq. \ref{eq:p3step2} to Eq. \ref{eq:p3step3} and let us bound the difference $|\expval{Y\vert \dointv(\mathbf{x})}{\alphaext(Y)} - \expval{Y\vert \dointv(\mathbf{x})}{\alphamap{Y'}(Y)}|$. Notice that in the expression $\expval{Y\vert \dointv(\mathbf{x})}{\alphamap{Y'}(Y)}|$ the expected value is taken on the domain of $Y'$ w.r.t. an interventional distribution over $Y$; this expression makes sense thanks to assumption (AS3).

Now, we know that:
$$
    \alphamap{{Y'}}(Y) = \alphaext(Y) + \epsilon_{Y'}(Y), 
$$
where $\epsilon_{Y'}(Y)$ is the interpolation error due to the approximation of $\alphamap{{Y'}}$ as $\alphaext$. We then have:
\begin{align*}
    |\expval{Y\vert \dointv(\mathbf{x})}{\alphaext(Y)} - \expval{Y\vert \dointv(\mathbf{x})}{\alphamap{Y'}(Y)}| & = \\
    |\expval{Y\vert \dointv(\mathbf{x})}{\alphamap{\expvalop{Y'}}(Y)} - \expval{Y\vert \dointv(\mathbf{x})}{\alphamap{\expvalop{Y'}}(Y) + \epsilon_{Y'}(Y)}| & = \\
    |\expval{Y\vert \dointv(\mathbf{x})}{\alphamap{\expvalop{Y'}}(Y)} - \expval{Y\vert \dointv(\mathbf{x})}{\alphamap{\expvalop{Y'}}(Y)} - \expval{Y\vert \dointv(\mathbf{x})}{\epsilon_{Y'}(Y)}| & = \\
    |\expval{Y\vert \dointv(\mathbf{x})}{\epsilon_{Y'}(Y)}|.
\end{align*}
Thus, the bound depends on the quality of the interpolation $\alphaext$ under the expectation $\expvalop{}$. In other words, the bounds depends the interpolation error $\epsilon_{Y'}(Y)$ weighted by the distribution over $Y$. An extension $\alphaext$ that perfectly interpolates all the points of $\alphamap{{Y'}}$ would reduce this bound to zero.



(iii) Let us consider the passage from Eq. \ref{eq:p3step3} to Eq. \ref{eq:p3step4} and let us bound the difference $|\expval{Y\vert \dointv(\mathbf{x})}{\alphamap{Y'}(Y)} - \expval{Y'\vert \alphamap{{X'}}(\dointv(\mathbf{x}))}{Y'}|$.

First, because of the definition of pushforward, it holds that, in general, $\expval{f_\#p}{X} = \expval{p}{f(X)}$, therefore:
$$
\expval{Y\vert \dointv(\mathbf{x})}{\alphamap{Y'}(Y)} = \expval{{\alphamap{Y'}}_\#(Y\vert \dointv(\mathbf{x}))}{Y},
$$
which, using our shorthanding, can be written as:
$$
\expval{Y\vert \dointv(\mathbf{x})}{\alphamap{Y'}(Y)} = \expval{\abs(Y\vert \dointv(\mathbf{x}))}{Y}.
$$

We then want to evaluate the new difference:
$$
|\expval{\abs(Y\vert \dointv(\mathbf{x}))}{Y} - \expval{Y'\vert \alphamap{{X'}}(\dointv(\mathbf{x}))}{Y'}|.
$$
Because of the identity between the outcome domains of assumption (AS3), we can re-express the above difference as:
$$
|\expval{\abs(Y\vert \dointv(\mathbf{x}))}{Y} - \expval{Y'\vert \alphamap{{X'}}(\dointv(\mathbf{x}))}{Y}|.
$$
Since the Wasserstein distance provides us with a bound on the distance between the expected distributions we have that:
%This is the same difference we already bounded in Proposition \ref{prop:qtransfer_bias}, meaning that:
$$
|\expval{\abs(Y\vert \dointv(\mathbf{x}))}{Y} - \expval{Y'\vert \alphamap{{X'}}(\dointv(\mathbf{x}))}{Y}| \leq \dwassII( \abs(\prob(Y\vert \dointv(\mathbf{x})))), \prob(Y'\vert \alphamap{{X'}}(\dointv(\mathbf{x}))) ),
$$
which, again, is bounded by the worst-case distance computed by the IC error over the intervention set:
$$
|\expval{\abs(Y\vert \dointv(\mathbf{x}))}{Y} - \expval{Y'\vert \alphamap{{X'}}(\dointv(\mathbf{x}))}{Y}| \leq \abserr.
$$

Thus, overall, we can bound the error of transferring the expected reward as:
$$
|\alphaext(\expval{Y\vert \dointv(\mathbf{x})}{Y})-\expval{Y'\vert \alphamap{\mathbf{X'}}(\dointv(\mathbf{x}))}{Y'}| \leq |\expval{Y\vert \dointv(\mathbf{x})}{\epsilon_{Y'}(Y)}| + \abserr,
$$
as a function of the approximation error and the IC error. $\QED$

\subsection{Proof of Lemma \ref{lem:transferexp_confidence}}

\textbf{Lemma \ref{lem:transferexp_confidence}} (Confidence bound for $\alphaext$)
\emph{Assuming $\domain[Y]=\domain[Y']=[0,1]$ and assuming we used a linear interpolating function $\alphaext$ to compute $\hat{\mu}'_{\dointv(\mathbf{x}'_i)}$ as in Eq. \ref{eq:alphaexpval_transport}, with probability at least $1 - \delta$, it holds that 
    $| {\mu}'_{\dointv(\mathbf{x}'_i)} - \hat{\mu}'_{\dointv(\mathbf{x}'_i)} | \leq \kappa$,
    %$ |\expval{Y'|\dointv(\actioni_i)}{Y'} - \estexpval{Y'|\dointv(\actioni_i)}{Y'}| \leq \kappa$, 
    where $\kappa = \sqrt{\frac{2\log(2/\delta)}{\counter{\dointv(\actioni_i)}}} + |\expval{Y\vert \dointv(\mathbf{x})}{\epsilon_{Y'}(Y)}| + \abserr$ and $\counter{\dointv(\mathbf{x}'_i)}$ counts the number of times action $\dointv(\mathbf{x}'_i)$ was taken.}

\emph{Proof.} Combining Hoeffding's inequality:
$$
\prob(|\expval{Y'|\dointv(\actioni_i)}{Y'} - \estexpval{Y'|\dointv(\actioni_i)}{Y'}| \geq \delta) \leq \exp^{-2 \counter{\actioni_i}^2 \delta^2}
$$
with the bias computed in Prop. \ref{prop:transferexp_biasedness}, we can immediately derive a confidence bound given by the sum of the upper bound provided by Hoeffding's inequality \citep{lattimore2016causal} and the bias due to the IC and interpolation error:
$$
|\expval{Y'|\dointv(\actioni_i)}{Y'} - \estexpval{Y'|\dointv(\actioni_i)}{Y'}| \leq \sqrt{\frac{2\log(2/\delta)}{\counter{\dointv(\actioni_i)}}} + |\expval{Y\vert \dointv(\mathbf{x})}{\epsilon_{Y'}(Y)}| + \abserr,
$$
that is,
$$
|\expval{Y'|\dointv(\actioni_i)}{Y'} - \estexpval{Y'|\dointv(\actioni_i)}{Y'}| \leq \kappa. \QED
$$

\subsection{Proof of Proposition \ref{prop:transferexp_regret}}

\textbf{Proposition \ref{prop:transferexp_regret}} (Cumulative regret of \transferexpect)
\emph{Given a CAMAB, assume we use \transferexpect{} to initialize the expected rewards of the abstracted CMAB, and then we run \ucb{} for $T$ steps. Then the cumulative regret we incur is bounded by:
    $$
    \cregri_\algte \leq 3 \sum_{\actioni_i \in \optintervseti} \gap{\actioni_i} + \sum_{\actioni_i \in \optintervseti: \gap(\actioni_i)>0} \frac{16 + log(T)}{\gap{\actioni_i}}.
    $$}

\emph{Proof.} Since \transferexpect{} is running \ucb{} after transferring the expected values, its cumulative regret equals \ucb{}'s cumulative regret \citep{lattimore2020bandit} after correcting for the new action set $\optintervseti$. Therefore:
$$
    \cregri_\algte \leq 3 \sum_{\actioni_i \in \optintervseti} \gap{\actioni_i} + \sum_{\actioni_i \in \optintervseti: \gap(\actioni_i)>0} \frac{16 + log(T)}{\gap{\actioni_i}}. \QED
$$











\section{CAMAB Algorithms}\label{app:Algorithms}

In this section we summarize the algorithms we have considered, we provide their pseudo-code and we compare them from a computational point of view. 


\paragraph{Direct training.} As a baseline for our CAMABs, we considered the possibility of training the abstracted agent directly on the CMAB $\banditi$ using a standard MAB algorithm $\alg$ (see Alg. \ref{alg:direct}). In our experiments we adopted the standard \ucb{} algorithm \citep{lattimore2020bandit}.
Direct training requires an agent to select actions, collect rewards, and update its estimates.

\paragraph{\transferoptimum.} A naive approach to solve a CAMAB would be to solve the base CMAB $\bandit$, learn the optimal action $\action_o$, and then simply set the policy of the abstracted CMAB $\banditi$ to select deterministically the abstracted action $\abs(\action_o)$.
The \transferoptimum{} algorithm does not require any decision-making, reward-collection or update; it just requires the translation of a single action.

\paragraph{\imitation.} Given the set of trajectories $\trajectoryset=\{(\actiont{t},\rewardvalt{t})\}_{t=1}^T$ collected during the training of the base agent, the abstracted agent can be trained by imposing that it follows the corresponding action trajectories under $\abs$. At each timestep $t$, instead of choosing action $\actionit{t}$, the agent translates action $\actiont{t}$ using the map $\abs$ into the image action $\abs(\actiont{t})$ (see Alg. \ref{alg:imitation}).
\imitation{} removes decision-making from the abstracted agent, but it still requires running the model $\scmi$ to compute rewards, as well as updating estimates after every action.


\paragraph{\transferexpect.} 
Instead of relying on the trajectories, the abstracted agent could be directly initialized by transferring the expected reward learned by the base agent, discarding sub-optimal action, and then following a direct training approach (see Alg. \ref{alg:qtransfer}). In the \transferexpect{} algorithm, the abstracted agent does not have to translate all the trajectories in $\mathcal{D}$, but simply transfer the derived statistics. 



\begin{algorithm}[tb]
	\caption{Direct Learning}\label{alg:direct}
	\begin{algorithmic}[1]
		\STATE { \textbf{Input:} } CMAB $\bandit$, time horizon $T$ 
		\STATE { \textbf{Output:} } optimal policy $\policy$
		
		\STATE Initialize expected rewards $\estexpvalt{Y\vert \action_i}{Y}{0}$, auxiliary statistics $\suppstatst{0}$, and policy $\policyt{0}$ \COMMENT{Setup the params}
		\FOR{$t = 1 ... T$ }{
			
			\STATE Select $\actiont{t} \distributes \policyt{t-1}$ \COMMENT{Decision-making}
			
			\STATE Receive $\rewardvalt{t} \distributes \rewarddistr{\actiont{t}}$ \COMMENT{Reward-collection}
			
			\STATE Compute $\estexpvalt{Y\vert \actiont{t}}{Y}{t},\suppstatst{t} \leftarrow \operatorname{update}\left(\estexpvalt{Y\vert \actiont{t}}{Y}{t-1}, \suppstatst{t-1},\actiont{t},\rewardvalt{t}\right)$ \COMMENT{Update stats}
		
                \STATE Compute $\policyt{t} \leftarrow \alg\left(\estexpvalt{Y\vert \actiont{t}}{Y}{t}, \suppstatst{t}\right)$ \COMMENT{Update policy}
		} 
  
		\ENDFOR
		
		\STATE { \textbf{Return:} } $\policyt{T}$
		
	\end{algorithmic}
\end{algorithm}



\paragraph{Comparison of the algorithms}
The computational steps of algorithms proposed above are summarized in Tab. \ref{tab:comp_steps}. A couple of observations are in order.
%First, direct training simply represents a baseline, as it does not make any use of the abstraction $\abs$.
The \transferoptimum{} algorithm is computationally extremely cheap but, as discussed in the main text, it may not return an optimal policy. The \imitation{} algorithm may be computationally and economically more efficient than direct training, as it replaces the step of decision-making with a simple function application. In addition, when coupled with a learning algorithm $\alg$ where updating statistics $\mathcal{S}^{(t)}$ is not timestep dependent, the \imitation{} algorithm can process all the trajectories in $\mathcal{D}$ at once in parallel.
\transferexpect{} allows to transfer information more efficiently by translating a single statistics instead of re-running every action; the algorithm follow the steps of a standard MAB algorithm with the advantage of an effective initialization which can allow it to converge to the optimal action faster.
%Second, \imitation{} and replay require the abstraction of individual quantities, that is, individual actions $a^{(t)}$ and reward $r^{(t)}$, which can be transferred by simple application of $\abs$; on the other hand, transferring $\Q^{(t)}$ requires the abstraction of a whole probability distribution, which can happen either unidirectionally once (as in Q-transfer) or bidirectionally continuously (as in Q-exchange).





\begin{figure}
	\begin{centering}
		\begin{tabular}{c>{\centering}p{1.2cm}>{\centering}p{1.2cm}>{\centering}p{1.2cm}>{\centering}p{1.2cm}>{\centering}p{1.2cm}>{\centering}p{1.2cm}}
			\hline 
			& Decision-making & Reward-collection & Update & Action-translation & Reward-translation & Exp value-translation \tabularnewline
			\hline 
			\emph{Direct} & $\checkmark$ & $\checkmark$ & $\checkmark$ &  & & \tabularnewline
                \hline 
			\transferoptimum{} &  & & & $\checkmark$ & & \tabularnewline
			\hline 
			\imitation{} &  & $\checkmark$ & $\checkmark$ & $\checkmark$ & & \tabularnewline
			% \hline 
			% \emph{Replay} &  &  & $\checkmark$ & $\checkmark$ & $\checkmark$ &\tabularnewline
                \hline 
			\transferexpect{} & $\checkmark$ & $\checkmark$ & $\checkmark$ &  &  & $\checkmark$\tabularnewline
			\hline 
   %              \emph{Q-exchange} & $\checkmark$ & $\checkmark$ & $\checkmark$ &  &  & $\checkmark$\tabularnewline
			% \hline 
		\end{tabular}
		\par\end{centering}
	\caption{Computational steps}\label{tab:comp_steps}
	
\end{figure}

\subsection{Transfer of rewards} \label{app:RewardTransfer}
An intermediate approach between \imitation{} and \transferexpect{} is offered by an algorithm transferring rewards. Assuming rewards have been collected and stored, they could be individually transferred through the map $\alphamap{Y'}$, as illustrated by the green arrows in Fig. \ref{fig:imitation}. This algorithm would transfer individual quantities like \imitation{} while focusing directly on rewards like \transferexpect{}.

If given a zero IC error abstraction, reward samples from $\prob(Y\vert\dointv(\mathbf{x}_i))$ can be exactly transformed via $\abs$ into samples of $\abs(\prob(Y\vert\dointv(\mathbf{x}_i))) = \prob(Y'\vert\abs(\dointv(\mathbf{x}_i)))$, allowing for an unbiased estimation of the expected values.

Compared to \transferexpect, transporting reward enjoys a lower bias since it relies on the true abstraction map $\alphamap{Y'}$ instead of an extension $\alphaext$ introducing an interpolation error. Confidence bounds for this algorithm, derived as in Lemma \ref{lem:transferexp_confidence}, could then reduce to $\kappa = \sqrt{\frac{2\log(2/\delta)}{\counter{\dointv(\actioni_i)}}} + |\expval{Y\vert \dointv(\mathbf{x})}{\epsilon_{Y'}(Y)}| + \abserr$, and used to construct a reduced intervention set $\optintervseti$.

\section{Experimental Details} \label{app:Experiments}

\subsection{Transfer of optimal action}\label{app:Scenario1}

\emph{Scenario 1.} In the first experiment we consider the CAMAB defined in Ex. \ref{ex:counterexample1} where we take $\alphamap{T'}$ and $\alphamap{Y'}$ to be identity matrices, as illustrated in Fig. \ref{fig:counterexample2_1}. This constitutes a CAMAB with an abstraction which is exact $\abserr=0$ and maximum preserving. We run \ucb{} on the base CMAB for a variable number of steps $n_{steps} = \{10, 25, 50, 100, 250, 500\}$. After $n_{steps}$ we transfer the learned optimal action $\action_o$ in the base CMAB to the abstracted CMAB as $\abs(\action_o)$ using \transferoptimum. We then compute the simple regret at each $n_{steps}$. We repeat the procedure $20$ times and report means of the simple regret as solid lines. The resulting simple regrets are plotted in Fig. \ref{fig:simul1} in blue, with the solid line representing \transferoptimum, and the dashed line representing \ucb.

\emph{Scenario 2.} In the second experiment we consider the CAMAB defined in Ex. \ref{ex:counterexample1}  where we take $\alphamap{T'}$ and $\alphamap{Y'}$ to be anti-diagonal matrices, as illustrated in Fig. \ref{fig:counterexample2_2}. This constitutes a CAMAB with an abstraction which is exact $\abserr=0$ and non-maximum preserving. We run the same protocol as on \emph{Scenario 1}. The resulting simple regrets are plotted in Fig. \ref{fig:simul1} in green, with the solid line representing \transferoptimum, and the dashed line representing \ucb.

\subsection{Transfer of actions}\label{app:Scenario2}
\emph{Scenario 3.} In the third experiment we consider the CAMAB defined in Ex. \ref{ex:counterexample1} except for the mechanism $f_{Y'}$ in the abstracted CMAB that takes the form $f_{Y'} = \left[\begin{array}{cc}
			.7 & .3\\
			.3 & .7
		\end{array}\right]$. This new model is illustrated in Fig. \ref{fig:scenario3}. The change in the abstracted CMAB induces an IC error $\abserr \approx 0.229$. We run \ucb{} on the base CMAB, \ucb{} and \imitation{} on the abstracted CMAB for $500$ steps, while tracking the cumulative regret. We repeat the procedure $20$ times and report means of the simple regret as solid lines. Fig. \ref{fig:scenario2_1a} shows that the estimated expected rewards when training $\banditi$ directly with \ucb{} (blue) or using \imitation{} (orange) are unbiased w.r.t. the true values (black), as stated by Lemma \ref{lem:imitation_unbiasedness}. However, Fig. \ref{fig:scenario2_1b} confirms that the cumulative regret for \imitation{} is significantly higher than \ucb{} due to agent taking suboptimal actions from $\abs(\policy)$, as implied by Prop. \ref{prop:imitation_asymptoticregret}.

\begin{figure*}
    \centering
	\begin{subfigure}{.49\textwidth}
		\centering
        \includegraphics[scale=0.45]{img/scenario2_1a.png}
        \caption{}
		\label{fig:scenario2_1a}
	\end{subfigure}
	\begin{subfigure}{.49\textwidth}
		\centering
        \includegraphics[scale=0.45]{img/scenario2_1b.png}
		\caption{}
		\label{fig:scenario2_1b}
	\end{subfigure}
	%\vspace{-2\baselineskip}
	\caption{Expected reward and regret using \imitation. Shaded areas in (b) represent the standard deviation of the regret.}\label{fig:results2B}
	%\vspace{-1\baselineskip}
\end{figure*}

\emph{Scenario 4.} In the fourth experiment we consider the same CAMAB used in \emph{Scenario 1} and illustrated in Fig. \ref{fig:counterexample2_1}. We run the same protocol as on \emph{Scenario 3}. Fig. \ref{fig:scenario2_2a} shows the cumulative regret of \ucb{} and \imitation{} on the abstracted CMAB, while Fig. \ref{fig:scenario2_2b} shows the difference between \ucb{} and \imitation{} in case of zero IC error (\emph{Scenario 4}) and non-zero IC error (\emph{Scenario 3}) in accordance with Lemma \ref{lem:imitation_cumregret}.

\emph{Scenario 5.} In the fifth scenario we instantiate a CAMAB as in Example \ref{ex:absCMAB}. We define the base mechanisms as $f_{T} = \left[\begin{array}{c}
			.7 \\
			.2 \\
                .1
		\end{array}\right]$, $f_{T} = \left[\begin{array}{ccc}
			.2 & .8 & .7 \\
			.8 & .2 & .3 \\
		\end{array}\right]$,  $f_{Y} = \left[\begin{array}{cc}
			.7 & .3 \\
			.3 & .7 \\
		\end{array}\right]$, and the abstracted mechanisms as $f_{T'} = \left[\begin{array}{c}
			.8 \\
			.2
		\end{array}\right]$, $f_{Y'} = \left[\begin{array}{cc}
			.55 & .45 \\
			.45 & .55 \\
		\end{array}\right]$.

We consider two possible abstractions: $\abs_1$ is defined by $\Rset=\{T,Y\}$, $\amap(T)=T', \amap(Y)=Y'$, and $\alphamap{T'} = \left[\begin{array}{ccc}
			1 & 0 & 0 \\
			0 & 1 & 1 \\
		\end{array}\right]$ and $\alphamap{Y'}$ as an identity; $\abs_2$ is defined in the same way except for $\alphamap{T'} = \left[\begin{array}{ccc}
			0 & 1 & 0 \\
			1 & 0 & 1 \\
		\end{array}\right]$.
The two corresponding CMAB are illustrated in Fig. \ref{fig:scenario5a} and \ref{fig:scenario5b}.
In the base CMAB we consider the following set of action $\intervset = \{ \emptyset, \dointv(T=0), \dointv(T=1), \dointv(T=2) \}$, which is mapped by the abstraction to the set $\intervseti = \{ \emptyset, \dointv(T'=0), \dointv(T'=1) \}$. The true expected rewards in the base CMAB are $\expval{Y\vert \emptyset}{Y} = 0.56, \expval{Y\vert \dointv(T=0)}{Y} = 0.62, \expval{Y\vert \dointv(T=1)}{Y} = 0.38, \expval{Y\vert \dointv(T=2)}{Y} = 0.42$, while in the abstracted CMAB we have $\expval{Y'\vert \emptyset}{Y'} = 0.47, \expval{Y'\vert \dointv(T'=0)}{Y'} = 0.45, \expval{Y'\vert \dointv(T'=1)}{Y'} = 0.55$. Thus, in the base CMAB $\optaction=\dointv(T=0)$ and in the abstracted CMAB $\optactioni=\dointv(T'=1)$.
We run the same protocol as on \emph{Scenario 3}. Fig. \ref{fig:simul2} shows the difference between the cumulative regret of \ucb{} and \imitation{} on the abstracted CMAB in the two CAMAB differing only for the abstraction, $\abs_1$ and $\abs_2$. Notice that the result is consistent with our theoretical results, as $\abs_1$ which does not preserve the optimal action achieves much higher cumulative regret than $\abs_1$ which maps $\optaction$ and another high-reward action onto $\optactioni$.


\begin{figure}
\begin{subfigure}{.48\textwidth}
        \begin{center}
		\begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=0.8, every node/.style={scale=0.8}]
	\tikzstyle{node_style} = [circle,draw=black]
	\node[node_style] (S) at (0,0) {T};
	\node[node_style] (T) at (2.5,0) {M};
	\node[node_style] (C) at (5,0) {Y};
	
	\node[node_style] (Si) at (0,-3) {T'};
	\node[node_style] (Ci) at (5,-3) {Y'};
	% \draw[->]  (-2,0) to node[above,font=\small]{$\left[\begin{array}{c}
	% 		\cdot\\
	% 		\cdot
	% 	\end{array}\right]$} (S);
	\draw[->]  (S) to node[above,font=\small]{$\left[\begin{array}{cc}
			.2 & .8\\
			.8 & .2
		\end{array}\right] $} (T);
	\draw[->]  (T) to node[above,font=\small]{$\left[\begin{array}{cc}
			.7 & .3 \\
			.3 & .7 
		\end{array}\right]$} (C);
	
	% \draw[->]  (-2,-3) to node[below,font=\small]{$\left[\begin{array}{c}
	% 		\cdot\\
	% 		\cdot
	% 	\end{array}\right]$} (Si);
	\draw[->]  (Si) to node[below,font=\small]{$\left[\begin{array}{cc}
			.7 & .3 \\
			.3 & .7 
		\end{array}\right]$} (Ci);
	
	\draw[->,dashed]  (S) to node[left,font=\small]{$\left[\begin{array}{cc}
			1 & 0\\
			0 & 1
		\end{array}\right]$} (Si);
	\draw[->,dashed]  (C) to node[right,font=\small]{$\left[\begin{array}{cc}
			1 & 0\\
			0 & 1
		\end{array}\right]$} (Ci);
    \end{tikzpicture}
        \caption{Scenario 3.}
		\label{fig:scenario3}
        \end{center}
	\end{subfigure}
	\begin{subfigure}{.48\textwidth}
        \begin{center}
		\begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=0.8, every node/.style={scale=0.8}]
	\tikzstyle{node_style} = [circle,draw=black]
	\node[node_style] (S) at (0,0) {T};
	\node[node_style] (T) at (2.5,0) {M};
	\node[node_style] (C) at (5,0) {Y};
	
	\node[node_style] (Si) at (0,-3) {T'};
	\node[node_style] (Ci) at (5,-3) {Y'};
	% \draw[->]  (-2,0) to node[above,font=\small]{$\left[\begin{array}{c}
	% 		\cdot\\
	% 		\cdot
	% 	\end{array}\right]$} (S);
	\draw[->]  (S) to node[above,font=\small]{$\left[\begin{array}{ccc}
			.2 & .8 & .7\\
			.8 & .2 & .3
		\end{array}\right] $} (T);
	\draw[->]  (T) to node[above,font=\small]{$\left[\begin{array}{cc}
			.7 & .3 \\
			.3 & .7 
		\end{array}\right]$} (C);
	
	% \draw[->]  (-2,-3) to node[below,font=\small]{$\left[\begin{array}{c}
	% 		\cdot\\
	% 		\cdot
	% 	\end{array}\right]$} (Si);
	\draw[->]  (Si) to node[below,font=\small]{$\left[\begin{array}{cc}
			.55 & .45\\
			.45 & .55
		\end{array}\right]$} (Ci);
	
	\draw[->,dashed]  (S) to node[left,font=\small]{$\left[\begin{array}{ccc}
			1 & 0 & 0\\
			0 & 1 & 1
		\end{array}\right]$} (Si);
	\draw[->,dashed]  (C) to node[right,font=\small]{$\left[\begin{array}{ccc}
			1 & 0 \\
			0 & 1 
		\end{array}\right]$} (Ci);
\end{tikzpicture}
		\caption{Scenario 5a.}
		\label{fig:scenario5a}
        \end{center}
	\end{subfigure}
\caption{Models for experimental simulations.}


\end{figure}



\begin{figure}
\begin{subfigure}{.48\textwidth}
        \begin{center}
		\begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=0.8, every node/.style={scale=0.8}]
	\tikzstyle{node_style} = [circle,draw=black]
	\node[node_style] (S) at (0,0) {T};
	\node[node_style] (T) at (2.5,0) {M};
	\node[node_style] (C) at (5,0) {Y};
	
	\node[node_style] (Si) at (0,-3) {T'};
	\node[node_style] (Ci) at (5,-3) {Y'};
	% \draw[->]  (-2,0) to node[above,font=\small]{$\left[\begin{array}{c}
	% 		\cdot\\
	% 		\cdot
	% 	\end{array}\right]$} (S);
	\draw[->]  (S) to node[above,font=\small]{$\left[\begin{array}{ccc}
			.2 & .8 & .7\\
			.8 & .2 & .3
		\end{array}\right] $} (T);
	\draw[->]  (T) to node[above,font=\small]{$\left[\begin{array}{cc}
			.7 & .3 \\
			.3 & .7 
		\end{array}\right]$} (C);
	
	% \draw[->]  (-2,-3) to node[below,font=\small]{$\left[\begin{array}{c}
	% 		\cdot\\
	% 		\cdot
	% 	\end{array}\right]$} (Si);
	\draw[->]  (Si) to node[below,font=\small]{$\left[\begin{array}{cc}
			.55 & .45\\
			.45 & .55
		\end{array}\right]$} (Ci);
	
	\draw[->,dashed]  (S) to node[left,font=\small]{$\left[\begin{array}{ccc}
			1 & 0 & 0\\
			0 & 1 & 1
		\end{array}\right]$} (Si);
	\draw[->,dashed]  (C) to node[right,font=\small]{$\left[\begin{array}{ccc}
			1 & 0 \\
			0 & 1 
		\end{array}\right]$} (Ci);
\end{tikzpicture}
        \caption{Scenario 5b.}
		\label{fig:scenario5b}
        \end{center}
	\end{subfigure}
	\begin{subfigure}{.48\textwidth}
        \begin{center}
		\begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=0.8, every node/.style={scale=0.8}]
	\tikzstyle{node_style} = [circle,draw=black]
	\node[node_style] (S) at (0,0) {T};
	\node[node_style] (T) at (2.5,0) {M};
	\node[node_style] (C) at (5,0) {Y};
	
	\node[node_style] (Si) at (0,-3) {T'};
	\node[node_style] (Ci) at (5,-3) {Y'};
	% \draw[->]  (-2,0) to node[above,font=\small]{$\left[\begin{array}{c}
	% 		\cdot\\
	% 		\cdot
	% 	\end{array}\right]$} (S);
	\draw[->]  (S) to node[above,font=\small]{$\left[\begin{array}{cc}
			.2 & .8\\
			.8 & .2
		\end{array}\right] $} (T);
	\draw[->]  (T) to node[above,font=\small]{$\left[\begin{array}{cc}
			.6 & .3 \\
			.3 & .4 \\
                .1 & .3
		\end{array}\right]$} (C);
	
	% \draw[->]  (-2,-3) to node[below,font=\small]{$\left[\begin{array}{c}
	% 		\cdot\\
	% 		\cdot
	% 	\end{array}\right]$} (Si);
	\draw[->]  (Si) to node[below,font=\small]{$\left[\begin{array}{cc}
			.6 & .3 \\
			.3 & .4 \\
                .1 & .3
		\end{array}\right]\left[\begin{array}{cc}
			.2 & .8\\
			.8 & .2
		\end{array}\right]$} (Ci);
	
	\draw[->,dashed]  (S) to node[left,font=\small]{$\left[\begin{array}{cc}
			0 & 1\\
			1 & 0
		\end{array}\right]$} (Si);
	\draw[->,dashed]  (C) to node[right,font=\small]{$\left[\begin{array}{ccc}
			1 & 0 & 0\\
			0 & 1 & 1
		\end{array}\right]$} (Ci);
\end{tikzpicture}
		\caption{Scenario 6.}
		\label{fig:scenario6}
        \end{center}
	\end{subfigure}
\caption{Models for experimental simulations.}


\end{figure}


\subsection{Transfer of expected values}\label{app:Scenario3}

\emph{Scenario 6.} In the sixth experiment we consider the CAMAB defined in Ex. \ref{ex:counterexample1} except for the domains $\domain[Y]=\domain[Y']=\{0,1,2\}$, the mechanism $f_{Y}$ in the base CMAB that takes the form $f_{Y} = \left[\begin{array}{cc}
			.6 & .3 \\
			.3 & .4 \\
                .1 & .3
		\end{array}\right]$ and the map $\alphamap{Y'}$ being an identity. This new model is illustrated in Fig. \ref{fig:scenario6}. The CAMAB has zero IC error. We run \ucb{} on the base CMAB, \ucb{} and \transferexpect{} on the abstracted CMAB for $500$ steps, while tracking the cumulative regret. We repeat the procedure $20$ times and report means of the simple regret as solid lines. Fig. \ref{fig:simul3} (blue) shows the cumulative regret of \transferexpect{} (solid) and \ucb{} (dashed).

\emph{Scenario 7.} In the seventh experiment we consider the CAMAB used in \emph{Scenario 6} but we redefine the domain $\domain[Y'] = \{0.4,0.5,10\}$. We run the same protocol as on \emph{Scenario 5}. Fig. \ref{fig:simul3} (green) shows the cumulative regret of \transferexpect{} (solid) and \ucb{} (dashed).

\begin{figure*}
    \centering
	\begin{subfigure}{.49\textwidth}
		\centering
        \includegraphics[scale=0.45]
  {img/scenario2_2a.png}
        \caption{}
		\label{fig:scenario2_2a}
	\end{subfigure}
	\begin{subfigure}{.49\textwidth}
		\centering
        \includegraphics[scale=0.45]{img/scenario2_2b.png}
		\caption{}
		\label{fig:scenario2_2b}
	\end{subfigure}
	%\vspace{-2\baselineskip}
	\caption{Regret using \imitation.  Shaded areas in (a) represent the standard deviation of the regret.}\label{fig:results3}
	%\vspace{-1\baselineskip}
\end{figure*}

\begin{figure*}
    \centering
        \begin{subfigure}{.49\textwidth}
		\centering
        \includegraphics[scale=0.45]{img/scenario4_1.png}
		\caption{}
		\label{fig:scenario4_1}
	\end{subfigure}
        \begin{subfigure}{.49\textwidth}
		\centering
        \includegraphics[scale=0.45]{img/scenario4_2.png}
		\caption{}
		\label{fig:scenario4_2}
	\end{subfigure}
	%\vspace{-2\baselineskip}
	\caption{Regret for various CMAB transfer learning schemes.  Shaded areas represent the standard deviation of the regret.}\label{fig:results4}
	%\vspace{-1\baselineskip}
\end{figure*}






\subsection{Comparison with transfer learning}
We compare here our approach to the \texttt{B-UCB} algorithm proposed in \citet{zhang2017transfer} for transferring information across different CMABs with different but related structures. \citet{zhang2017transfer} considers three different tasks; we reproduce the first two tasks, but we exclude the third task, as it would require modelling an abstraction going from an abstracted model to a base model, which we have not worked out yet.

\emph{Task 1}. In the first task the source model $\scm$ is defined on three binary variables: the two observable $X,Y$ and the unobserved $U$. We encode the original mechanisms in the following matrices: $f_{U} = \left[\begin{array}{c}
			.3 \\
			.7 
		\end{array}\right]$, $f_{X}(U) = \left[\begin{array}{cc}
			1 & 0\\
			0 & 1
		\end{array}\right]$, and $f_{Y}(U,X) = \left[\begin{array}{cccc}
			.9 & .5 & .1 & .7\\
			.1 & .5 & .9 & .3
		\end{array}\right]$. Allowed actions are $\dointv(X=0)$ and $\dointv(X=1)$.
The target model $\scmi$ is similarly defined on three binary variables: the two observable $X',Y'$ and the unobserved $U'$. The different mechanisms are encoded in the following matrices: $f_{U'} = \left[\begin{array}{c}
			.3 \\
			.7 
		\end{array}\right]$, $f_{X'}(U') = \left[\begin{array}{c}
			.5 \\
			.5
		\end{array}\right]$, and $f_{Y'}(U',X') = \left[\begin{array}{cccc}
			.9 & .5 & .1 & .7\\
			.1 & .5 & .9 & .3
		\end{array}\right]$. Allowed actions are $\dointv(X'=0)$ and $\dointv(X'=1)$.

We take the source and target model as the base and abstraction model of an abstraction, respectively. We define a CAMAB setting up an abstraction $\abs$ with the following natural choices: 
\begin{itemize}
    \item $\Rset=\{U,X,Y\}$;
    \item $\amap(U)=U', \amap(X)=X', \amap(Y)=Y'$;
    \item all the matrices $\alphamap{U'},\alphamap{X'},\alphamap{Y'}$ as identities.
\end{itemize}
The map between the intervention sets is automatically defined. Notice that we do not collect observational samples, but only interventional samples for interventions that satisfy assumption (AB1).

Given this setup, we solve the CMABs using standard \ucb{}, \texttt{B-UCB} \citep{zhang2017transfer}, \transferoptimum, \imitation{} and \transferexpect{} for $500$ steps.  We repeat the training $20$ times, and show the cumulative regret in Fig. \ref{fig:scenario4_1}.

\emph{Task 2}. In the second task the source model $\scm$ is defined on four binary variables: the three observable $Z, X,Y$ and the unobserved $U$. We encode the original mechanisms in the following matrices: $f_{U} = \left[\begin{array}{c}
			.2 \\
			.8 
		\end{array}\right]$, $f_{Z} = \left[\begin{array}{c}
			.1 \\
			.9 
		\end{array}\right]$, $f_{X}(U,Z) = \left[\begin{array}{cccc}
			1 & 0 & 0 & 1\\
			0 & 1 & 1 & 0
		\end{array}\right]$, and $f_{Y}(U,X) = \left[\begin{array}{cccc}
			.1 & .9 & .5 & .1\\
			.9 & .1 & .5 & .9
		\end{array}\right]$. Allowed actions are $\dointv(X=0)$ and $\dointv(X=1)$.
The target model $\scmi$ is similarly defined on three binary variables: the two observable $X',Y'$ and the unobserved $U'$. The different mechanisms are encoded in the following matrices: $f_{U'} = \left[\begin{array}{c}
			.2 \\
			.8 
		\end{array}\right]$, $f_{X'}(U') = \left[\begin{array}{c}
			.5 \\
			.5
		\end{array}\right]$, and $f_{Y'}(U',X') = \left[\begin{array}{cccc}
			.1 & .9 & .5 & .1\\
			.9 & .1 & .5 & .9
		\end{array}\right]$. Allowed actions are $\dointv(X'=0)$ and $\dointv(X'=1)$.

We take the source and target model as the base and abstraction model of an abstraction, respectively. We define a CAMAB setting up an abstraction $\abs$ with the following natural choices: 
\begin{itemize}
    \item $\Rset=\{U,X,Y\}$;
    \item $\amap(U)=U', \amap(X)=X', \amap(Y)=Y'$;
    \item all the matrices $\alphamap{U'},\alphamap{X'},\alphamap{Y'}$ as identities.
\end{itemize}
The map between the intervention sets is automatically defined. Notice that we do not rely on interventional samples, and we focus only on those interventions that comply with our assumption (AB1); therefore, we ignore, possible interventions on $Z$ in the base model $\scm$.

Given this setup, we solve the CMABs using standard \ucb{}, \texttt{B-UCB} \citep{zhang2017transfer}, \transferoptimum, \imitation{} and \transferexpect{} for $500$ steps.  We repeat the training $20$ times, and show the cumulative regret in Fig. \ref{fig:scenario4_2}. Notice the absence of a line for \texttt{B-UCB}; this is due to the fact that, relying on the computed bounds, \texttt{B-UCB} can immediately identify the optimal action.

\begin{figure}
\begin{subfigure}{.48\textwidth}
        \begin{center}
		\begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=0.8, every node/.style={scale=0.8}]
	\tikzstyle{node_style} = [circle,draw=black]
	\node[node_style] (X) at (0,0) {X};
	\node[node_style,dashed] (U) at (2.5,1) {U};
	\node[node_style] (Y) at (5,0) {Y};
	
	\node[node_style] (Xi) at (0,-3) {X'};
	\node[node_style,dashed] (Ui) at (2.5,-2) {U'};
	\node[node_style] (Yi) at (5,-3) {Y'};
	
	\draw[->]  (U) to node[above left,font=\small]{$\left[\begin{array}{cc}
			1 & 0\\
			0 & 1
		\end{array}\right] $} (X);
        \draw[->] (X) to (Y);
	\draw[->]  (U) to node[above right,font=\small]{$\left[\begin{array}{cccc}
			.9 & .5 & .1 & .7\\
			.1 & .5 & .9 & .3
		\end{array}\right]$} (Y);
	
        \draw[->] (Ui) to (Yi);
	\draw[->]  (Xi) to node[below,font=\small]{$\left[\begin{array}{cccc}
			.9 & .5 & .1 & .7\\
			.1 & .5 & .9 & .3
		\end{array}\right]$} (Yi);
	
	\draw[->,dashed]  (X) to node[left,font=\small]{$\left[\begin{array}{cc}
			1 & 0\\
			0 & 1
		\end{array}\right]$} (Xi);
	\draw[->,dashed]  (U) to node[right,font=\small]{$\left[\begin{array}{cc}
			1 & 0\\
			0 & 1
		\end{array}\right]$} (Ui);
        \draw[->,dashed]  (Y) to node[right,font=\small]{$\left[\begin{array}{cc}
			1 & 0\\
			0 & 1
		\end{array}\right]$} (Yi);
\end{tikzpicture}
        \caption{First task.}
		\label{fig:transfertask_1}
        \end{center}
        
	\end{subfigure}
	\begin{subfigure}{.48\textwidth}
        \begin{center}
		\begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=0.8, every node/.style={scale=0.8}]
	\tikzstyle{node_style} = [circle,draw=black]
	\node[node_style] (Z) at (-2,1) {Z};
        \node[node_style] (X) at (0,0) {X};
	\node[node_style,dashed] (U) at (2.5,1) {U};
	\node[node_style] (Y) at (5,0) {Y};
	
	\node[node_style] (Xi) at (0,-3) {X'};
	\node[node_style,dashed] (Ui) at (2.5,-2) {U'};
	\node[node_style] (Yi) at (5,-3) {Y'};

        \draw[->] (Z) to (X);
	\draw[->]  (U) to node[above left,font=\small]{$\left[\begin{array}{cccc}
			1 & 0 & 0 & 1\\
			0 & 1 & 1 & 0
		\end{array}\right] $} (X);
        \draw[->] (X) to (Y);
	\draw[->]  (U) to node[above right,font=\small]{$\left[\begin{array}{cccc}
			.1 & .9 & .5 & .1\\
			.9 & .1 & .5 & .9
		\end{array}\right]$} (Y);
	
        \draw[->] (Ui) to (Yi);
	\draw[->]  (Xi) to node[below,font=\small]{$\left[\begin{array}{cccc}
			.1 & .9 & .5 & .1\\
			.9 & .1 & .5 & .9
		\end{array}\right]$} (Yi);
	
	\draw[->,dashed]  (X) to node[left,font=\small]{$\left[\begin{array}{cc}
			1 & 0\\
			0 & 1
		\end{array}\right]$} (Xi);
	\draw[->,dashed]  (U) to node[right,font=\small]{$\left[\begin{array}{cc}
			1 & 0\\
			0 & 1
		\end{array}\right]$} (Ui);
        \draw[->,dashed]  (Y) to node[right,font=\small]{$\left[\begin{array}{cc}
			1 & 0\\
			0 & 1
		\end{array}\right]$} (Yi);
\end{tikzpicture}
		\caption{Second task.}
		\label{fig:transfertask_2}
        \end{center}
	\end{subfigure}
\caption{Models for the transfer tasks.}
\end{figure}

With the exception of the cases where \transferoptimum{} achieves a lower regret thanks to the preservation of the maximum via the abstraction, in general, \bucb{} performs better than \imitation{} and \transferexpect{}. This is due to the fact that \bucb{} provides tighter bounds than \transferexpect{}. Indeed, the bounds of \bucb{} are specifically derived for the models under consideration. However, its specificity comes at the cost of the applicability of the algorithm, as discussed in the main paper: while \bucb{} requires to derive custom bounds between models defined on identical variables, 





\subsection{Online advertisement}
In the online advertisement simulation we run our algorithm on the CMAB defined in \citet{lu2020regret} to model an advertisement campaign at Adobe. The base CMAB $\bandit = \langle \scm,\intervset \rangle$ is defined on the endogenous variables $\envars = \{ Pr,Pu,SL,BT,ST,CK\}$ with mechanisms as follows:
    \begin{itemize}
        \item $Pr$ is the product advertised, with values in $\domain[Pr]=\{\textrm{Photoshop},\textrm{Acrobat IX Pro}, \textrm{Stock}\}$; the mechanism associated with this variable is $f_{Pr}=\left[\begin{array}{c}
			.2\\
			.2\\
                .6 
		\end{array}\right]$;
        \item $Pu$ is the purpose of the advertisement, with values in $\domain[Pu]=\{\textrm{operational},\textrm{promo},\textrm{nurture},\textrm{awareness}\}$; the mechanism associated with this variable is $f_{Pu}=\left[\begin{array}{c}
			.05\\
			.6\\
                .3\\
                .05
		\end{array}\right]$;
        \item $SL$ is the subject length of the sent message, with values in $\domain[SL]=\{\textrm{less or equal to 7 words},\textrm{more than 7 words}\}$; the mechanism associated with this variable is $f_{SL}(Pu)=\left[\begin{array}{cccc}
			.3 & .3 & .7 & .7\\
			.7 & .7 & .3 & .3
		\end{array}\right]$;
         \item $BT$ is the body template of the message, with values in $\domain[BT]=\{\textrm{template 1},\textrm{template 2}\}$; the mechanism associated with this variable is $f_{BT}(Pr,Pu)=\left[\begin{array}{cccccccccccc}
			.2&.1&.5&.8&.2&.1&.5&.8&.4&.3&.4&.5\\
			.8&.9&.5&.2&.8&.9&.5&.2&.6&.7&.6&.5
		\end{array}\right]$;
        \item $ST$ is the sending out time of the message, with values in $\domain[ST]=\{\textrm{morning},\textrm{afternoon},\textrm{evening}\}$; the mechanism associated with this variable is $f_{ST}=\left[\begin{array}{c}
			.5\\
			.2\\
                .3
		\end{array}\right]$;
        \item $CK$ is the clicking of a customer, with values in $\domain[CK]=\{0,1\}$; the mechanism associated with this variable is $f_{CK}(SL,BT,ST)=\left[\begin{array}{cccccccccccc}
			\nicefrac{3}{9} & \nicefrac{4}{9} & \nicefrac{5}{9}& \nicefrac{4}{9} & \nicefrac{5}{9} & \nicefrac{6}{9} & \nicefrac{4}{9} & \nicefrac{5}{9} & \nicefrac{6}{9} & \nicefrac{5}{9} & \nicefrac{6}{9} & \nicefrac{7}{9}\\
			\nicefrac{6}{9} & \nicefrac{5}{9} & \nicefrac{4}{9}& \nicefrac{5}{9} & \nicefrac{4}{9} & \nicefrac{3}{9} & \nicefrac{5}{9} & \nicefrac{4}{9} & \nicefrac{3}{9} & \nicefrac{4}{9} & \nicefrac{3}{9} & \nicefrac{2}{9}\\
		\end{array}\right]$.
  \end{itemize}
As in \citet{lu2020regret} we take that the only intervenable variables are the choice of the product ($Pr$), the purpose ($Pu$) and the sendout time ($ST$). This CMAB is illustrated in Fig. \ref{fig:EmailCampaign}(left).

From this CMAB, we derived a simplified CMAB through the following simplifications: (i) we drop the send-out time variable ($ST$) since it is not strictly related to the time the mail is read by the customer; (ii) we simplify the number of purposes ($Pu$), both because some purposes are very unlikely (operational, awareness) and because they have identical effect on subject length ($SL$); (iii) we reduce the number of products, assuming Photoshop and Stock being merged into a single product. We then instantiate an abstracted CMAB $\banditi = \langle \scmi,\intervseti \rangle$ defined on the endogenous variables $\envars = \{ Pr',Pu',SL',BT',ST',CK' \}$, where:
    \begin{itemize}
        \item $Pr'$ has the same meaning of $Pr$ but has domain $\domain[Pr']=\{\textrm{Photoshop},\textrm{Acrobat IX Pro}\}$ and mechanism $f_{Pr'}=\left[\begin{array}{c}
			.8\\
                .2
		\end{array}\right]$;
        \item $Pu'$ has the same meaning of $Pu$ but has domain $\domain[Pu']=\{\textrm{promo},\textrm{nurture}\}$ and mechanism $f_{Pu'}=\left[\begin{array}{c}
			.65\\
                .35
		\end{array}\right]$;
        \item $SL'$ has the same meaning and domain of $SL$, but mechanism $f_{SL'}(Pu')=\left[\begin{array}{cc}
			.3 &  .7\\
			.7 &  .3
		\end{array}\right]$;
        \item $BT'$ has the same meaning and domain of $BT$, but mechanism $f_{BT'}(Pr',Pu')=\left[\begin{array}{cccc}
			.3&.5&.15&.65\\
			.7&.5&.85&.35
		\end{array}\right]$;
        \item $CK'$ has the same meaning and domain of $CK$, but mechanism $f_{CK'}(SL',BT')=\left[\begin{array}{cccc}
			\nicefrac{5}{9} & \nicefrac{4}{9} & \nicefrac{4}{9}& \nicefrac{3}{9} \\
			\nicefrac{4}{9} & \nicefrac{5}{9} & \nicefrac{5}{9}& \nicefrac{6}{9} 
		\end{array}\right]$;
    \end{itemize}
The remaining intervenable variables are the choice of the product ($Pr'$), the purpose ($Pu'$). This CMAB is illustrated in Fig. \ref{fig:EmailCampaign}(right).

To fully define our CAMAB, we define an abstraction $\abs$ between the CMABs:
\begin{itemize}
    \item $\Rset = \{ Pr, Pu, SL, BT, CK \}$;
    \item $\amap(Pr) = Pr'$, $\amap(Pu) = Pu'$, $\amap(SL) = SL'$, $\amap(BT) = BT'$, $\amap(CK) = CK'$;
    \item $\alphamap{Pr'} = \left[\begin{array}{ccc}
			1 & 0 & 1\\
			0 & 1 & 0
		\end{array}\right]$, $\alphamap{Pu'} = \left[\begin{array}{cccc}
			1 & 1 & 0 & 0\\
			0 & 0 & 1 & 1
		\end{array}\right]$, $\alphamap{SL'} = \alphamap{BT'} = \alphamap{CK'} = \left[\begin{array}{cc}
			1 & 0 \\
			0 & 1
		\end{array}\right]$.
\end{itemize}


\begin{figure}
    \centering
		\begin{tikzpicture}[shorten >=1pt, auto, node distance=1cm, thick, scale=0.7, every node/.style={scale=0.7}]
			\tikzstyle{node_style} = [ellipse,draw=black]
			\node[node_style] (CK) at (0,0) {Clicking};
    
                \node[node_style] (SL) at (-2.5,-1.5) {Subject Length};
                \node[node_style] (BT) at (0,-2.5) {Body Template};
			\node[node_style] (ST) at (2.5,-1.5) {Sending Time};

                \node[node_style] (Pr) at (1.5,-4) {Product};
                \node[node_style] (Pu) at (-1.5,-4) {Purpose};
			
                \draw[->]  (Pr) to (BT);
                \draw[->]  (Pu) to (SL);
                \draw[->]  (Pu) to (BT);
                \draw[->]  (SL) to (CK);
                \draw[->]  (BT) to (CK);
                \draw[->]  (ST) to (CK);


                \node[node_style] (CKi) at (10,0) {Clicking};
    
                \node[node_style] (SLi) at (8,-2) {Subject Length};
                \node[node_style] (BTi) at (12,-2) {Body Template};

                \node[node_style] (Pri) at (8,-4) {Product};
                \node[node_style] (Pui) at (12,-4) {Purpose};
			
                \draw[->]  (Pri) to (BTi);
                \draw[->]  (Pui) to (SLi);
                \draw[->]  (Pui) to (BTi);
                \draw[->]  (SLi) to (CKi);
                \draw[->]  (BTi) to (CKi);
                
		\end{tikzpicture}
		\caption{Email campaign CAMAB: base CMAB from \citet{lu2020regret} (left) and abstracted CMAB (right).}
		\label{fig:EmailCampaign}
\end{figure}


To satisfy the assumption (AB1) we define the actions in the intervention sets as follows:
\begin{itemize}
    \item $\intervset = \{ \dointv(Pu=\textrm{operational}), \dointv(Pu=\textrm{promo}), \dointv(Pu=\textrm{nurture}), \dointv(Pu=\textrm{awareness}), \dointv(Pr=\textrm{Photoshop}), \dointv(Pr=\textrm{Acrobat IX Pro}) \}$
    \item $\intervseti = \{ \dointv(Pu'=\textrm{promo}), \dointv(Pu'=\textrm{nurture}),  \dointv(Pr'=\textrm{Photoshop}), \dointv(Pr'=\textrm{Acrobat IX Pro}) \}$
\end{itemize}
with the mapping between the two action sets naturally given by $\alphamap{Pu'}, \alphamap{Pr'}$.

Now given this CAMAB, we solve the abstracted CMAB running \ucb{}, \transferoptimum, \imitation, and \transferexpect{} for 1000 episodes. We repeat the procedure 20 times and report means of the simple regret as solid lines. Fig. \ref{fig:simul4} shows the regret of all our algorithms. The success of \transferoptimum{} and \imitation{} can be explained respectively by the preservation of the optimum and the aggregation of the two actions with higher exepcted rewards. The result of \transferexpect{} is due to a moderate IC error and a low interpolation error; the resulting bounds were not tight enough to exclude  actions from $\optintervseti$, but sufficient to transfer information to achieve a regret in line with, or inferior to \ucb.


\section{Further Related Work} \label{app:FurtherWork}

MABs with side information are closely related to the transfer of information between MABs. For example, \citet{zhang2017transfer} leverage causal tools to derive upper and lower bounds on the mean reward associated with each action, before running an adapted \ucb{} algorithm which leverages this information. In this sense, the approach proposed by \citet{zhang2017transfer} is similar to the algorithm of \cite{sharma2020warm} who leverage side information in the form of bounds on the mean reward of each arm to improve regret guarantees.

Similarly, CMABs are closely connected to MABs with correlated arms \citep{Singh2024, gupta2021multi} in the sense that the SCM underlying a CMAB implicitly describes a correlation structure between actions. Likewise, the CAMAB problem we introduce shares similarities to MABs with side observations \citep{neuside, mannorside, caronstoch, buccapstoch}, wherein taking one action provides reward information about other arms. Regional MABs \citep{pmlr-v84-wang18b, wang2016} also share a close connection to the transfer setting we propose. In regional MABs, actions are partitioned into clusters indicating that their reward distributions share a common but unknown parameter. Given a causal abstraction exists between two SCMs, interventions in the base model mapped to the same abstracted intervention are implicitly clustered together. However, the meaning of these clusters in terms of their relevance to the expected reward of a given action is far more ambiguous in our setting.

A further review of causal abstraction relating the framework of $\tau$-transformations \citet{rubenstein2017causal,beckers2018abstracting} and $\abs$ abstraction \citet{rischel2020category} is provided in \citet{zennaro2022abstraction}. A related approach to simplifying models based on clustering was proposed by \citet{anand2023causal}. Results on measuring abstraction error have been developed for both frameworks in \cite{beckers2020approximate,rischel2021compositional,zennaro2023quantifying}. Finally, both abstraction frameworks have been used for learning abstraction between SCMs \citep{zennaro2023jointly,felekis2023causal,dyer2023interventionally,xia2024neural}.

\end{document}


