\newpage

\onecolumn

\title{Targeted Reduction of Causal Models\\(Supplementary Material)}
\maketitle
\appendix

\section{SUPPLEMENTAL RELATED WORK}\label{app:relwork}

Desiderata for CMR have been addressed theoretically by several works, in particular in the context of CFL \citep{chalupka2014visual,chalupka2016unsupervised}, and subsequently with the notion of \textit{exact transformations}  \citep{rubenstein2017causal} and a more strongly constrained subclass: \textit{causal abstractions} \citep{beckers2019abstracting}.
An alternative framework for composing abstractions of finite models has been proposed by \citet{rischel2021compositional}. 
However, only few works have addressed how to build high-level representations from the  low-level system data only. 
A line of works focuses on language models \citep{geiger2021causal,geiger2023causal}, where high-level variables and interpretations are readily available.
\par
We start from the opposite direction and develop a general approach to build the high-level abstraction from the ground-up.
Such a construction is done in CFL \citep{chalupka2014visual,chalupka2016unsupervised}, where high-dimensional microscopic variables are turned into discrete high-level variables.
\citet{zennaro2023jointly} addressed this question in the context of finite and discrete domains, by minimizing the maximum Jensen-Shannon divergence over a finite set of perfect intervention distributions. 
In contrast with most works on CMR, our framework is fully compatible with imperfect (soft intervention) at the low level, which are more realistic and interpretable perturbations of many real-world systems than hard interventions. 
Soft interventions have been used for language model alignment \citep{geiger2023finding}, and their theoretical compatibility with the abstraction framework has been investigated by \citet{massidda2023causal}.  
Our approach aims at approximating an exact transformation, and is thus a relaxation of this setting.
\par
Other theoretical frameworks for approximate abstractions have been proposed \citep{beckers2020approximate,rischel2021compositional}.
Our work differs by providing an explicit loss well-suited to continuous causal models, that can be optimized efficiently and provide interpretable outcomes thanks to a cause-mechanism decomposition, a lower bound, and analytic solutions. 
\par
The way we relax constraints on low-level interventions shares also similarities with the views of \citet{zhu2023meaningful} who consider stochastic low-level do-interventions sampled according to the observational distribution, our work is instead focused on soft-interventions, for which we impose a prior distribution, reflecting the relative importance that we put on them.
Our optimization objective is averaged over this prior, such that it plays a role in the final solution.
\par
Our approach also relates to the search for optimal interventional or counterfactual manipulations to steer the output of a system to a particular value or distribution \citep{amos2018differentiable,besserve2018counterfactuals} or to best explain an observation \citep{budhathoki2022causal,von2023backtracking}.
We are in a way also selecting particular manipulations, but through the choice of dimensionality reduction $\omega$, such that they are interpretable at a high level.
\par
Finally, our approach relates to several works in causal representation learning, which have addressed identifiability of latent causal models from observational data, with \citep{liang2023causal} or without \citep{squires2023linear,von2023nonparametric} assumptions on the latent causal graph.
In contrast to those works, TCR does not assume an injective mapping of the mapping of the observations to the latent variables, such that the high-level model typically losses information relative to the low-level model. 


\section{SUPPLEMENTAL BACKGROUND}
\label{app:supbkgd}

\subsection{Numerical schemes for simulations}\label{app:numschemes}
Methods for the numerical approximations of scientific models is a broad area spanning multiple fields.
We provide here a few elements based on a 1D example to justify how these models relate to SCMs. 
The Euler method \citep{euler1794institutiones}, can be used to approximate a 1D ODE of the form
\[
\begin{cases}
x(t_0)=x_0\,,\\
\frac{dx}{dt}=F(x(t))\,,    
\end{cases}
\]
with $F$ smooth real-valued function, using a discretized time grid with time step $\Delta t$.
The finite difference approximation of the derivative 
\[
x(t+\Delta t)-x(t)\approx \Delta t \cdot \frac{d x}{d t}\,,
\]
leads to the iterative numerical scheme for the approximation $\hat{x}(k\Delta t)$:
\[
\begin{cases}
\hat{x}(t_0)=x_0\,,&\\
\hat{x}(t_0+(k+1)\Delta t)= 
\hat{x}(t_0+k\Delta t)+\Delta t \cdot F(\hat{x}(t_0+k\Delta t))\,,& k>0\,.
\end{cases}
\]
This scheme is called \emph{explicit} because future values depend explicitly on past ones.
In that case, one may define the $N$ low-level endogenous variables as $\Xb=[\hat{x}(t_0+\Delta t),\,\dots,\,\hat{x}(t_0+N\Delta t)]^\top$.
They can be seen as pertaining to a chain SCM with structural equations
\[
\begin{cases}
\widehat{X}_1 \coloneqq x_0 + \Delta t \cdot F(x_0)&\\
\widehat{X}_{k+1} \coloneqq \widehat{X}_k + \Delta t \cdot F(\widehat{X}_k)\,,& k>1.
    \end{cases}
\]
Because the ODE describes deterministic dynamics, the corresponding SCM is deterministic as well, \textit{i.e.}\ exogenous variables can be taken as trivial zero constants.
However, if we turn this ODE into the following 1D SDE 
\[
\begin{cases}
X(t_0)=x_0\,,\\
dX=F(X(t))dt+\sigma_W\cdot dW\,,    
\end{cases}
\]
where $W$ is a standard Brownian motion, then the Euler-Murayama method generalizes the previous approximation
\citep{sauer2013computational}, and leads to an updated SCM with structural equation
\[
\begin{cases}
\widehat{X}_1 \coloneqq x_0 + \Delta t \cdot F(x_0)+U_1&\\
\widehat{X}_{k+1} \coloneqq \widehat{X}_k + \Delta t \cdot F(\widehat{X}_k)+U_{k+1}\,,& k>1\,,
    \end{cases}
\]
where the exogenous variables $U_k$ represent the increments of the scaled Brownian motion $\sigma_W\cdot W$ between successive time steps, and are thus jointly independent Gaussian due to fundamental properties of Brownian motion. 
\par
This approach generalizes to explicit numerical schemes for multivariate ODE and SDEs where the state variable $\Xb$ is an element of $\RR^n$.
As an illustration, we can take the following class of SDE models 
\[
\begin{cases}
\Xb(t_0)=\xb_0\,,\\
d\Xb=\Fb(\Xb(t))dt+\boldsymbol{\sigma}_W\cdot d\Wb\,,    
\end{cases}
\]
with $\Fb:\RR^n\to \RR^n$, $\boldsymbol{\sigma}_W=\RR^{(n\times n)}$, and $\Wb$ a $n$-dimensional standard Brownian motion. 
This leads to the scheme
\[
\begin{cases}
\widehat{\Xb}_1 \coloneqq \xb_0 + \Delta t \cdot \Fb(\xb_0)+\Ub_1&\\
\widehat{\Xb}_{k+1} \coloneqq \widehat{\Xb}_k + \Delta t \cdot \Fb(\widehat{X}_k)+\Ub_{k+1}\,,& k>1\,,
    \end{cases}
\]
where the $\Ub_{k}$ are now multivariate Gaussian variables, whose components may or may not be independent depending on the choice of the matrix $\boldsymbol{\sigma}_W$.
If the exogenous components are independent, the variables can be described by a standard SCM as introduced in the main text.
If the exogenous components are dependent, the variables can be described by a more general notion of SCM, allowing hidden confounding \citep{bongers2021foundations}.
\par
Further generalization to numerical schemes for Stochastic Partial Differential Equations (SPDEs) using finite difference approximations for partial derivatives with respect to other variables than time are also possible \citep{millet2005implicit}.  


\subsection{Reduction of the Euler scheme for a system of point masses}\label{app:eulerpointmass}
In the context of the main text example, we assume each point mass is submitted to a fluid friction force opposing its movement with fixed coefficient $\lambda$.
Masses are moreover intervened on via additional external forces $\{\fb_k\}$.
Finally, internal forces are exerted on mass $k$ by other point masses of the system, summing up to $\gb_k$.
Newton's second law applied to individual masses results in the following system of 2D vector equations
\[
m_k \frac{d\vb_{k}}{dt} = -\lambda \vb_{k}(t)+\fb_{k}(t)+ \gb_{k}(t)\,.
\]
We can approximate each equation to iteratively estimate the $x$ and $y$ components of the speed of individual point masses in the system, using a small time-step $\Delta t$, such that we get the discrete time estimates $\hat{v}_{x,k}[n]\approx v_{x,k}(n\Delta t)$   and $\hat{v}_{y,k}[n]\approx v_{y,k}(n\Delta t)$ satisfying  
\begin{align*}
m_k \cdot \hat{v}_{x,k}[n+1] &\coloneqq (1-\Delta t \lambda)\cdot m_k \cdot \hat{v}_{x,k}[n]+\Delta t \cdot f_{x,k}(n\Delta t)+\Delta t\cdot g_{x,k}(n\Delta t)\,,\\
m_k \cdot \hat{v}_{y,k}[n+1] &\coloneqq  (1-\Delta t \lambda)\cdot  m_k \cdot \hat{v}_{y,k}[n]+\Delta t \cdot f_{y,k}(n\Delta t)+\Delta t\cdot g_{y,k}(n\Delta t).
\end{align*}
Here, $f$ represents external forces, $\lambda$ is a viscous damping coefficient, and $g$ denotes internal forces.
We consider $\ib$ to be the vector of all components of external forces, and the target variable to be the final horizontal speed of the center of mass at iteration $N$.
From the physics of freely moving systems of points, it is clear that the target variable can be predicted by considering only the horizontal dynamics of the center of mass.
More precisely, we integrate the sum of external forces over the time span of the experiment, and use the last intervened time point $n_f$ to predict the final outcome of the simulation, leading to the reduction 
\begin{align*}
Z_1^{(\omega(\ib))} &= \left(\sum_k m_k \right) v_{x,G}[n_f] = %\bar{\taub}_1 [v_{x,1}^(0),v_{x,k}^(n)] 
\sum_k m_k v_{x,k}[n_f]=\Delta t\sum_{n=0}^{n_f}\left(1-\Delta t \lambda\right)^{(n_f-n)}\sum_k f_{x,k}(n\Delta t))\,,\\
Y &= v_{x,G}[N] \coloneqq \left(1-\Delta t\lambda\right)^{(N-n_f)} Z_1+\sum_{n=n_f+1}^N \sum_k f_{x,k}(n\Delta t)) \, .
\end{align*}
To make the notation compatible with that used in our TCR framework, we can gather all speed variables in a high-dimensional vector $\Xb$ and all external force variables in a vector $\ib$, the high-level causal model is thus generated by a linear $\tau$-map and linear $\omega$ map for shift interventions, taking the form of the exact transformation
\begin{align*}
%z_1&=\sum m_k x_k^{(n_f)}+\Delta t \cdot \ v_{x,k}^{(n)}\\
%y_k^{(n+1)}&=y_k^{(n)}+\Delta t \cdot v_{y,k}^{(n)}\\
Z_1 &= \taub_1^\top \Xb +\omegab_1^\top \ib\,,\quad 
Y = \taub_0^\top \Xb +\omegab_0^\top \ib\coloneqq f(Z_1)\,.
\end{align*}
where the term $\omegab_0^\top \ib$ accounts for interventions happening between discrete times $n_f+1$ and $N$ and thus affect $Y$ without being mediated by $Z_1$. In our framework, only interventions mediated by the cause, reflected in the term $\omegab_1^\top \ib$, are accounted for in the high-level model. 

\subsection{Constructive transformations}

We complete the main text definition to include the constraint on the intervention map $\omega$
\begin{definition}
$(\tau,\omega):(\Xcal\to\Zcal,\Ical\to\Jcal)$ is a constructive $(\tau-\omega)$-transformation between model $\Lcal$ and $\Hcal$ if there exists an alignment map $\pi$ mapping each high-level endogenous variable to a subset of  low-level endogenous variables such that for all $k\neq l$, $\pi(k)\cap \pi(n)=\emptyset$ and we have both 
\begin{itemize}
\item for each component $\tau_k$ of $\tau$ there exists a function $\bar{\tau}_k$ such that for all $\xb$ in $\Xcal$,
\[
\tau_k (\xb) = \bar{\tau}_k(\xb_{\pi(k)})\,;
\]
    \item for each component $\omega_k$ of $\omega$ there exists a function $\bar{\omega}_k$ such that for all $\ib$ in $\Ical$,
\[
\omega_k (\ib) = \bar{\omega}_k(\ib_{\pi(k)})\,.
\]
\end{itemize}
\end{definition}


\section{PROOF OF MAIN TEXT RESULTS}
\label{app:proofs}
\subsection{Proof of Proposition~\ref{prop:consloss}}
We first reformulate Proposition~\ref{prop:consloss} more formally as follows.

\begin{proposition}\label{prop:conslossform}
The consistency loss is positive, invariant to invertible reparametrizations as defined in Definition~\ref{def:reparam}, and vanishes if and only if the transformation is exact for almost all interventions. 
It admits the following decomposition: 
\begin{equation}
\label{eq:CMdecapp}
   \Lcal_\mathrm{cons}\!
    =\!\EE_{i\sim P(\ib)}\!\! \left[
    \mathrm{KL}\left(
       \widehat{P}_{\tau}^{(\ib)}\left(\Zb\right)||{P}^{(\omega(\ib))}\left(\Zb\right)
    \right)+\EE_{\zb\sim \widehat{P}_{\tau}^{(\ib)}\left(\Zb\right)}\!\!\left[
        \mathrm{KL}\left(
            \widehat{P}_{\tau}^{(\ib)}\left({Y}|\zb)\right)||{P}^{(0)}\left({Y}|\zb\right)
        \right)
    \right]\right]\,,%\\
\end{equation}
and is an upper bound of the \emph{causal relevance loss} %defined as
\begin{equation}\label{eq:Lrelapp}
\Lcal_\mathrm{rel} \!=\! \EE_{\ib\sim P(\ib)}\!\left[\mathrm{KL}\left(        \widehat{P}^{(\ib)}\left(Y\right)||{P}^{(\omega(\ib))}\left(Y\right)\right)\right]\!\leq\! \Lcal_\mathrm{cons} \,. 
\end{equation}
\end{proposition}


\begin{proof}

\textbf{Positivity} of the loss comes from the positivity of the KL-divergence. Taking the expectation of this divergence with respect to $P(\ib)$ thus must be positive too.

\textbf{Invariance to reparameterizations.}
We assume a reparametrization $\rho$ designed according to the framework introduced in Appendix~$\ref{app:reparam}$. 
By invariance of the KL divergence to invertible transformations, we have equality between the KL associated to the two different reductions $(\tau,\omega)$ and $(\rho\circ \tau,\psi\circ\omega)$:
\[
\mathrm{KL}\left(
        \widehat{P}_{\tau}^{(\ib)}({Y},{\Zb})||P_{\Hcal,\gammab}^{(\omega(\ib))}(Y,\Zb)
    \right) = 
\mathrm{KL}\left(
        \tilde{\rho}_\#[\widehat{P}_{\tau}^{(\ib)}({Y},{\Zb})]||\tilde{\rho}_\#[P_{\Hcal,\gammab}^{(\omega(\ib))}(Y,\Zb)]
    \right) =\mathrm{KL}\left(
        \widehat{P}_{\rho\circ \tau}^{(\ib)}({Y},{\Zb})||P_{\Hcal,\gammab'}^{(\psi\circ\omega(\ib))}(Y,\Zb)
    \right) \, .
\]
The transformation $(\rho,\psi)$ thus leaves $\Lcal_\mathrm{cons}$ invariant. 


\textbf{Cause-mechanism decomposition.}
Under our setting (see Sec.~\ref{ssec:framew}), the interventional distribution of the high-level causal model factorizes as
\[
P^{(\omega(\ib))}({Y},{\Zb})={P}^{(0)}\left({Y}|{\Zb}\right){P}^{(\omega(\ib))}\left(\Zb\right)\,.
\]
The pushforward (by reduction) of the interventional distribution of the low-level model factorizes as
\[
\widehat{P}^{(\ib)}({Y},{\Zb})=\widehat{P}^{(\ib)}\left({Y}|{\Zb}\right)\widehat{P}^{(\ib)}\left({\Zb}\right)\,,
\]
with $\widehat{P}^{(\ib)}\left({\Zb}\right)=\tau_{1,\#}[{P}^{(\ib)}(\Xb_{\pi(1)})]$ and $\widehat{P}^{(\ib)}\left({Y}|{\Zb}\right)=\frac{\tau_\#\left[P^{(\ib)}\left({\Xb}_{\pi(0)},{\Xb}_{\pi(1)}\right)\right]}{\tau_{1,\#}[{P}^{(\ib)}(\Xb_{\pi(1)})]}$ .

Thus, the KL divergence can be decomposed as
\begin{align*}
    \mathrm{KL}&\left(
        \widehat{P}^{(\ib)}({Y},{\Zb})||P^{(\omega(\ib))}({Y},{\Zb})
    \right) \\
    &= \int_{\Ycal} \int_{\Zcal} \widehat{P}^{(\ib)}({Y},{\Zb}) \log  
    \frac{
        \widehat{P}^{(\ib)}({Y},{\Zb})
        }
        {P^{(\omega(\ib))}({Y},{\Zb})
        } d\Zb dY\\
         &= \int_{\Ycal} \int_{\Zcal} \widehat{P}^{(\ib)}\left({Y}|{\Zb}\right)\widehat{P}^{(\ib)}\left({\Zb}\right) \log  
    \frac{
        \widehat{P}^{(\ib)}\left({Y}|{\Zb}\right)\widehat{P}^{(\ib)}\left({\Zb}\right)
        }
        {{P}^{(0)}\left({Y}|{\Zb}\right){P}^{(\omega(\ib))}\left(\Zb\right)} d\Zb dY\\
    &=\mathrm{KL}_Z\left(
        \widehat{P}^{(\ib)}\left(\Zb\right)||{P}^{(\omega(\ib))}\left(\Zb\right)
    \right) 
    +\EE_{z\sim \widehat{P}^{(\ib)}\left(\Zb\right)}\left[
        \mathrm{KL}_Y\left(
            \widehat{P}^{(\ib)}\left({Y}|{\Zb}=z\right)||{P}^{(0)}\left({Y}|{\Zb}=z\right)
        \right)
    \right]
\\
    &=\mathrm{KL}_Z\left(
        \widehat{P}^{(\ib)}\left({\Zb}\right)||{P}^{(\omega(\ib))}\left(\Zb\right)
    \right)
    + \mathrm{KL}_{Y,Z}\left(
        \widehat{P}^{(\ib)}\left(\widehat{Y},{\Zb}\right)||{P}^{(0)}\left({Y}|{\Zb}\right)\widehat{P}^{(\ib)}\left(\Zb\right)
    \right)\,.
\end{align*}

We call the first term \textit{cause consistency loss}, as it matches the definition of a consistency loss but for cause variables only. 
The second term can be thought of as a \textit{mechanism consistency loss}, where we use the ground truth low-level cause distribution to probe the similarity of the outputs of the ``true'' (in fact, the conditional distribution) and approximate mechanism.
Our interpretability choice prevents the high-level mechanism from being intervened on, so a single stochastic map (\textit{i.e.}\ a Markov kernel) must fit at best all the sampled experimental conditionals. 


\paragraph{Lower bounding by causal relevance}

We may ask the question of causal relevance of high-level causes.
One way to quantify this is to assess whether the variations of the target due to low-level interventions are well captured by high-level interventions, which can be measured by a KL divergence on the target's marginal
\[
\Lcal_\mathrm{rel} = \EE_{\ib\sim p(\ib)}\left[\mathrm{KL}_Y\left(        \widehat{P}^{(\ib)}\left(Y\right)||{P}^{(\omega(\ib))}\left(Y\right)\right)\right] \, .
\]

Note: In the Gaussian 1D case, the formula for the causal relevance loss is
\[
\Lcal_\mathrm{rel} = \frac{1}{2}\EE_{\ib\sim p(\ib)}\left[
            \frac{({\mu}_Y+\alpha \omegab^\top \ib -\widehat{\mu}^{(\ib)}_Y)^2}{\sigma_Y^2} 
            + \frac{{\widehat{\sigma^2}}^{(\ib)}_Y}{\sigma_Y^2} -\ln{\left(\frac{{\widehat{\sigma^2}}^{(\ib)}_Y}{\sigma_Y^2}\right)} -1
        \right]\,.
\]

Interestingly, we can break down this term using
\begin{align*}
    \mathrm{KL}&\left(
        \widehat{P}^{(\ib)}({Y},{\Zb})||P^{(\omega(\ib))}({Y},{\Zb})
    \right) \\
    &=\mathrm{KL}_Y\left(
        \widehat{P}^{(\ib)}\left(Y\right)||{P}^{(\omega(\ib))}\left(Y\right)
    \right) 
    +\EE_{y\sim \widehat{P}^{(\ib)}\left(Y\right)}\left[
        \mathrm{KL}_Y\left(
            \widehat{P}^{(\ib)}\left({\Zb}|{Y}=y\right)||{P}^{(\omega(\ib))}\left({\Zb}|{Y}=y\right)
        \right)
    \right]
\end{align*}
where both terms are positive by positivity of the KL divergence.
As a consequence,
\begin{align*}
    \mathrm{KL}_Y\left(
        \widehat{P}^{(\ib)}\left(Y\right)||{P}^{(\omega(\ib))}\left(Y\right)
    \right) &=\mathrm{KL}\left(
        \widehat{P}^{(\ib)}({Y},{\Zb})||P^{(\omega(\ib))}({Y},{\Zb})
    \right)
    -\EE_{y\sim \widehat{P}^{(\ib)}\left(Y\right)}\left[
        \mathrm{KL}_Y\left(
            \widehat{P}^{(\ib)}\left({\Zb}|{Y}=y\right)||{P}^{(\omega(\ib))}\left({\Zb}|{Y}=y\right)
        \right)
    \right]\\&\leq \mathrm{KL}\left(
        \widehat{P}^{(\ib)}(\widehat{Y},{\Zb})||P^{(\omega(\ib))}({Y},{\Zb})
    \right)=\Lcal_\mathrm{cons}\,.
\end{align*}
so the minimized consistency loss is an upper bound to causal relevance.

\end{proof}

\subsection{Proof of Proposition~\ref{prop:analytic_solution}}
\anasol*

\begin{proof}

\textbf{Part 1: 1-cause TCR.
}

\textbf{Overview.} We exploit the positive definiteness of the KL loss and its continuity with respect to $\ib$.
Since the variables are jointly Gaussian, continuity is obvious from the analytical expression of the KL for Gaussian variables and continuity of the shift operation applied to the parameters of the Gaussian.
We exploit the cause-mechanism decomposition and the lower-bound by $\Lcal_\mathrm{cons}$ to progressively identify necessary conditions on parameters to have $\Lcal_\mathrm{cons}=0$ and finally check those conditions are sufficient.  

%
\textbf{Preliminaries.}
Let $N_0$ denote the size of $\pi(0)$ and $N_1$ be the size of $\pi(1)$.
The SCM is assumed uniquely solvable (Definition~\ref{def:SCM}), such that $\xb=A \xb+\ub$ has a unique solution for almost all values of $\ub$.
Since $\Ub$ has full support, this implies that $I_N-A$ is invertible. 
The low-level variables then satisfy
\[
\Xb^{(\ib)} = (I_N-A)^{-1}(\Ub+\ib)
\]
where, due to the block triangular form of $A$,
\[
(I_N-A)^{-1} =
\begin{bmatrix}
(I_{N_0}-A_{00})^{-1}\,, &  (I_{N_0}-A_{00})^{-1}A_{0\Omega}(I_{N_1}-A_{\Omega\Omega})^{-1}\\
 0 &(I_{N_1}-A_{\Omega\Omega})^{-1}\,.
\end{bmatrix}
\]
In the assumed model all low-level variables are either in $\pi(0)$ or in $\Omega$.
Since the CMR is constructive we have $\pi(1)\subset \Omega$.
Without loss of generality, we can impose $\pi(1) = \Omega$ by setting the unused components of $\taub_1$ and $\omegab_1$ to zero.
For an arbitrary interventional setting $\ib$, this leads to the mapping of the low-level variable to the high-level cause variable, which we denote $\widehat{Z}_1^{(\ib)}=\tau_1(\Xb^{(\ib)})$, to satisfy
\begin{equation}\label{eq:pushforwardz}
\widehat{Z}_1^{(\ib)} = \taub_1^\top (I_N-A)^{-1}(\Ub+\ib) = \bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}(\Ub_{\Omega}+\ib_{\Omega}) \,.
\end{equation}
Moreover, because we assume also shift interventions in the high-level model, the cause $Z_1$ in this model has an interventional distribution satisfying
\[
P^{(\omega_1(\ib))}({Z}_1) = P^{(0)}\left({Z}_1-\omegab_1^\top(\ib)\right) \,.
\]

\textbf{Necessary conditions.} We are looking for solutions satisfying $\Lcal_\mathrm{cons}=0$.
By positivity of the KL divergence, this implies that for almost all $\ib$, the distribution of $\widehat{Z}_1^{(\ib)}$ matches the learned high-level interventional distribution of high-level cause $Z_1$, which satisfies
\[
P^{(\omega_1(\ib))}({Z}_1) = P^{(0)}\left({Z}_1-\omegab_1^\top(\ib)\right) \,.
\]

\textbf{Matching unintervened distributions of $Z_1$.} If Assum.~\ref{assum:prior} holds, the prior $P(\ib_{\Omega})$ has density with respect to the Lebesgue measure with support including a neighborhood of $\ib_{\Omega}=\boldsymbol{0}$.
By continuity of the KL divergence with respect to the intervention parameters, a solution making the consistency loss vanish needs to have the KL divergence term vanish for $\ib_{\Omega}=\boldsymbol{0}$ (otherwise we could find a neighborhood of $\ib_{\Omega}=\boldsymbol{0}$ such that the KL divergence does not vanish, by continuity of the KL divergence, and $\Lcal_\mathrm{cons}$ would be non-vanishing, contradicting our assumption). 

Alternatively, Assum.~\ref{assum:prior2} also obviously implies vanishing of the KL divergence for the unintervened setting. 

This vanishing of the KL divergence entails, again by positivity, that its terms, the two unintervened densities, are equal, such that we get, using Eq.~(\ref{eq:pushforwardz})
\begin{align}
P^{(0)}({Z}_1) =&\, (\tau_1)_{\#}\left[ P(\Xb) \right]= (\bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1})_{\#} \left[P(\Ub_{\Omega})\right]\\
=&\, \Ncal(\bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}\mub_{{\Omega}},\bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}\Sigma_{{\Omega}} (I_{N_1}-A_{\Omega\Omega})^{-\top}\bar{\taub}_1)\,,
\end{align}
which entails the following constraints on the variance and mean of the high-level cause
\begin{equation}\label{eq:sigz1}
\sigma_{Z_1}^2 = \bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}\Sigma_{{\Omega}} (I_{N_1}-A_{\Omega\Omega})^{-\top}\bar{\taub}_1
\end{equation}
and 
\begin{equation}\label{eq:muz1}
\mu_{Z_1}= \bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}\mub_{{\Omega}} \,.
\end{equation}

\textbf{Matching interventional distributions of $Z_1$.} For the same reasons, under Assum.~\ref{assum:prior}, we can further match the interventional distributions in an open set included in the interior of the support of $P(\ib)$, such that for all $\ib$ in this open set the following distributions are the same
\begin{align*}
    P^{(\omega_1(\ib))}({Z}_1) =&\, 
    \Ncal(\bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}(\mub_{\Omega}+\ib_{\Omega}),\bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}\Sigma_{\Omega} (I_{N_1}-A_{\Omega\Omega})^{-\top}\bar{\taub}_1)\,\mbox{and}\\
    (\tau_1)_{\#}\left[ P^{(\ib)}(\Xb) \right]=&\,\Ncal(\mu_{Z_1}+\omegab_1^\top \ib,\sigma_{Z_1}^2) 
    =\Ncal(\bar{\tau}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}\mub_{\Omega}+\omegab_1^\top \ib,\sigma^2_{Z_1}) \,.
\end{align*}
Indeed, otherwise the KL would not vanish in a neighborhood of non-zero measure and would contradict the assumption that $\Lcal_\mathrm{cons}$ vanishes. 

This implies that for all $\ib$ in this open neighborhood
\[
\bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}(\mub_{\Omega}+\ib_{\Omega})=\bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}\mub_{\Omega}+\bar{\omegab}_1^\top \ib_{\Omega}\,,
\]
which simplifies to
\[
\bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}\ib_{\Omega}=\bar{\omegab}_1^\top \ib_{\Omega}\,.
\]


Since this equality between two linear functions of $\ib_{\pi(1)}$ is valid on an open set of the vector space of $\ib_{\pi(1)}$, these functions must be equal (we can reparameterize $\ib$ to show that the linear maps must match on a basis of the space, so they are equal).
This is valid if and only if, in addition to Eqs.~(\ref{eq:sigz1}-\ref{eq:muz1}), 
\begin{equation}\label{eq:indent1}
\bar{\omegab}_1 =  (I_{N_1}-A_{\Omega\Omega})^{-\top}\bar{\taub}_1\,,
\end{equation}
is verified. 

Alternatively, we obtain the same result by replacing Assum.~\ref{assum:prior} by Assum.~\ref{assum:prior2}.
Indeed, the finite distribution over interventions imposes that the KL term inside the expectation must vanish for each of them (including the unintervened distribution).
As long as the collection of finite interventions vectors forms a rank $\#\Omega=\#\pi(1)=N_1$ family, we can choose a subset of $N_1$ such vectors $\{\ib^1_{\Omega},\dots,\ib_{\Omega}^{N_1}\}$ such that it forms a linearly independent family.
It can be used to build the matrix equality
\begin{equation}\label{eq:indent1matrix}
\bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}\left[\ib^1_{\Omega},\dots,\ib_{\Omega}^{N_1}\right]=\omegab_1^\top \left[\ib^1_{\Omega},\dots,\ib_{\Omega}^{N_1}\right]
\end{equation}
where the matrix $\left[\ib^1_{\Omega},\dots,\ib_{\Omega}^{N_1}\right]$ is invertible.
By right-multiplying Eq.~(\ref{eq:indent1matrix}) by this inverse, we obtain Eq.~(\ref{eq:indent1}) again.

\textbf{Matching distributions of $Y$.} We can move on to check the implication of consistency of the effect's conditional.
It entails for almost all of $\ib$
\[
\widehat{P}^{(\ib)}(Y|Z)\widehat{P}^{(\ib)}(Z)={P}^{(\omega(\ib))}(Y|Z)\widehat{P}^{(\ib)}(Z)={P}^{(0)}(Y|Z)\widehat{P}^{(\ib)}(Z)\,.
\]
Introducing $T=\begin{bmatrix}
    \taub_0^\top \\\taub_1^\top
\end{bmatrix}$ the left-hand side is obtained by using
\[
(Y,Z)\sim \Ncal(T\mu_\Xb,T \Sigma_{\Xb}T^\top) \,.
\]
And the right-hand side by using
\[
Y=f(Z_1)+R_0 \,.
\]
% This is a bit complex...
Fitting first only the marginals of $Y$, we obtain necessary conditions.
We have
\[
\widehat{P}^{(\ib)}(Y)={P}^{(\omega(\ib))}(Y)
\]
where for the left-hand side
\[
Y=\bar{\taub}_0^\top (I_{N_0}-A_{00})^{-1}\Ub_{\pi(0)} +\bar{\taub}_0^\top(I_{N_0}-A_{00})^{-1}A_{0\Omega}(I_{N_1}-A_{\Omega\Omega})^{-1}(\Ub_{\pi(1)}+\ib)
\]
and for the right-hand side
\[
Y\sim f_\# [P^{(\omega(\ib))}(Z_1)]*P(R_0) \,.
\]
Given the affine mechanism assumption of Eq.~(\ref{eq:linhighlevelmech}), $f(Z)=\alpha Z+\beta $ and under Assum.~\ref{assum:prior}, equality of marginal distributions entails the following equality for all $\ib_{\Omega}$ in an open neighborhood of $0$ (otherwise $\Lcal_\mathrm{rel}\leq \Lcal_\mathrm{cons}$ would not vanish)
\[
\bar{\taub}_0^\top (I_{N_0}-A_{00})^{-1}\mub_{\pi(0)} +\bar{\taub}_0^\top(I_{N_0}-A_{00})^{-1}A_{0\Omega}(I_{N_1}-A_{\Omega\Omega})^{-1}(\mub_{\Ub_1}+\ib_{\Omega})= \alpha \bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}(\mu_{\Omega}+\ib_{\Omega}) +\beta +\mub_{R_0}
\]
which requires (setting $\ib=0$)
\[
\beta +\mub_{R_0}= \bar{\taub}_0^\top (I_{N_0}-A_{00})^{-1}\mub_{{\pi(0)}} \,.
\]
We can fix $\mub_{R_0}$ to zero to avoid redundancy of additive constants, such that
\begin{equation}\label{eq:biasconst}
\mu_{Y|Z=0}= \beta= \bar{\taub}_0^\top (I_{N_0}-A_{00})^{-1}\mub_{{\pi(0)}}
\end{equation}
and consistency of non-zero shift interventions additionally entail for all $\ib_{\Omega}$ in the support of $P(\ib_{\Omega})$
\[
\bar{\taub}_0^\top(I_{N_0}-A_{00})^{-1}A_{0\Omega}(I_{N_1}-A_{\Omega\Omega})^{-1}\ib_{\Omega}= \alpha \bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}\ib_{\Omega} \,.
\]
Since one can always choose a linearly independent family of vectors $\ib_{\Omega}$ within the open neighborhood of zero for which this equality holds, this yields
\[
\bar{\taub}_0^\top(I_{N_0}-A_{00})^{-1}A_{0\Omega}(I_{N_1}-A_{\Omega\Omega})^{-1}= \alpha \bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1} \,.
\]
Then, right-multiplying by $(I_{N_1}-A_{\Omega\Omega})$, we get
\begin{equation}\label{eq:tauconst}
A_{0\Omega}^\top(I_{N_0}-A_{00})^{-\top}\bar{\taub}_0 = \alpha \bar{\taub}_1 \,.
\end{equation}
Similarly as above, the same conclusion can be drawn if we replace Assum.~\ref{assum:prior} by Assum.~\ref{assum:prior2}.

\textbf{Sufficiency of the constraints.} 
We have derived the expressions of the TCR parameters in Eqs.~(\ref{eq:indent1},\ref{eq:biasconst},\ref{eq:tauconst}) from necessary conditions for matching the marginals of high-level variables, $P(Z_1)$ and $P(Y)$, to their corresponding pushforward distributions of the low-level variables, $(\tau_1)_\# [P(\Xb)]$ and $(\tau_0)_\# [P(\Xb)]$.
Now what remains is to check the same for conditional distributions to show those conditions are sufficient.
Indeed, this implies that the joint high-level distributions of $(Y,Z_1)$ and $(\tau_0(\Xb),\tau_1(\Xb))$ are matching.
\par
Let us first note that given that the low-level model, the $\tau$ maps and the high-level mechanisms are linear or affine, and that the low-level exogenous variables are Gaussian, the exogenous high-level variable will necessarily be Gaussian as well to satisfy the consistency constraints. 
\par
Let us now compute the covariance matrix of the low-level variables. 
\[
cov(\Xb)=
\begin{bmatrix}
(I_{N_0}-A_{00})^{-1}& (I_{N_0}-A_{00})^{-1}A_{0\Omega}(I_{N_1}-A_{\Omega\Omega})^{-1}\\
\boldsymbol{0}  &(I_{N_1}-A_{\Omega\Omega})^{-1} 
\end{bmatrix} \Sigma_U \begin{bmatrix}
(I_{N_0}-A_{00})^{-\top} & \boldsymbol{0}\\
 (I_{N_1}-A_{\Omega\Omega})^{-\top}A_{0\Omega}^\top(I_{N_0}-A_{00})^{-\top} &(I_{N_1}-A_{\Omega\Omega})^{-\top}
\end{bmatrix}\,.
\]
Because the exogenous covariance is block diagonal, we get
\[
=
\begin{bmatrix}
\begin{matrix}
   (I_{N_0}-A_{00})^{-1}A_{0\Omega}(I_{N_1}-A_{\Omega\Omega})^{-1} \Sigma_{\Omega}  (I_{N_1}-A_{\Omega\Omega})^{-\top}A_{0\Omega}^\top(I_{N_0}-A_{00})^{-\top}\\+
  (I_{N_0}-A_{00})^{-1} \Sigma_{\pi(0)} (I_{N_0}-A_{00})^{-\top}\end{matrix}
 &(I_{N_0}-A_{00})^{-1}A_{0\Omega}(I_{N_1}-A_{\Omega\Omega})^{-1} \Sigma_{\Omega}(I_{N_1}-A_{\Omega\Omega})^{-\top} \\ \hline
  (I_{N_1}-A_{\Omega\Omega})^{-1}\Sigma_{\Omega}(I_{N_1}-A_{\Omega\Omega})^{-\top}A_{0\Omega}^\top(I_{N_0}-A_{00})^{-\top} &(I_{N_1}-A_{\Omega\Omega})^{-1}\Sigma_{\Omega}(I_{N_1}-A_{\Omega\Omega})^{-\top} 
  \end{bmatrix} \,.
\]
Then, if we denote $\widehat{Y}=\tau_0(\Xb)$
\[
cov((\widehat{Y},\widehat{Z}_1))=T cov(\Xb) T^{\top}
\]
and we can derive its conditional mean and covariance
\[
\mu_{\widehat{Y}|\hat{z}_1}= \mu_{\widehat{Y}}+ \bar{\taub}_0^\top (I_{N_0}-A_{00})^{-1}A_{0\Omega}(I_{N_1}-A_{\Omega\Omega})^{-1} \Sigma_{\pi(1)}(I_{N_1}-A_{\Omega\Omega})^{-\top} \bar{\taub}_1
\left(\bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}\Sigma_{\Omega}(I_{N_1}-A_{\Omega\Omega})^{-\top}\bar{\taub}_1 \right)^{-1}\left(\hat{z}_1-\mu_{Z_1}\right) \, .
\]
Thus using the above equation
\[
\mu_{\widehat{Y}|\hat{z}_1}= \mu_{\widehat{Y}}+ \alpha \left(\hat{z}_1-\mu_{Z_1}\right)=\bar{\taub}_0^\top (I_{N_0}-A_{00})^{-1}\mub_{\pi(0)} +\bar{\taub}_0^\top(I_{N_0}-A_{00})^{-1}A_{0\Omega}(I_{N_1}-A_{\Omega\Omega})^{-1}\mub_{\Omega}+ \alpha \hat{z}_1 -\alpha\bar{\taub}_1^\top  (I_{N_1}-A_{\Omega\Omega})^{-1}\mub_{\Omega}\,,
\]
which further simplifies with the same equation to
\[
\mu_{\widehat{Y}|\hat{z}_1}= \mu_{\widehat{Y}}+ \alpha \left(\hat{z}_1-\mu_{Z_1}\right)=\bar{\taub}_0^\top (I_{N_0}-A_{00})^{-1}\mub_{\Ub_0}+ \alpha \hat{z}_1 \,.
\]


Moreover,
\begin{multline*}
\mbox{var}({\widehat{Y}|\hat{z}_1})= \sigma_{\widehat{Y}}^2- \taub_0^\top (I_{N_0}-A_{00})^{-1}A_{0\Omega}(I_{N_1}-A_{\Omega\Omega})^{-1} \Sigma_{\Omega}(I_{N_1}-A_{\Omega\Omega})^{-\top} \bar{\taub}_1
\left(\bar{\taub}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}\Sigma_{\Omega}(I_{N_1}-A_{\Omega\Omega})^{-\top}\bar{\taub}_1 \right)^{-1}\\
\bar{\taub}_1^\top(I_{N_1}-A_{\Omega\Omega})^{-1}\Sigma_{\Omega}(I_{N_1}-A_{\Omega\Omega})^{-\top}A_{0\Omega}^\top(I_{N_0}-A_{00})^{-\top}\bar{\taub}_0
\end{multline*}
again, using the above equation this leads to the simplification
\begin{multline*}
\mbox{var}({\widehat{Y}|\hat{z}_1})= \sigma_{\widehat{Y}}^2- \alpha
\bar{\taub}_1^\top(I_{N_1}-A_{\Omega\Omega})^{-1}\Sigma_{\pi(1)}(I_{N_1}-A_{\Omega\Omega})^{-\top}A_{0\Omega}^\top(I_{N_0}-A_{00})^{-\top}\bar{\taub}_0\\
= \sigma_{\widehat{Y}}^2- \alpha
\bar{\taub}_1^\top(I_{N_1}-A_{\Omega\Omega})^{-1}\Sigma_{\pi(1)}(I_{N_1}-A_{\Omega\Omega})^{-\top}\bar{\taub}_1\alpha\\
=  \bar{\taub}_0^\top (I_{N_0}-A_{00})^{-1} \Sigma_{\pi(0)} (I_{N_0}-A_{00})^{-\top}\bar{\taub}_0 \,.
\end{multline*}

For the high-level distribution we get
\[
P(Y|z)= \Ncal( \alpha z +\beta,\sigma_{Y|Z}^2)
\]
where we can identify all parameters, including the mean and variance of the Gaussian exogenous variable $R_0$, with the above Eqs~(\ref{eq:indent1},\ref{eq:biasconst},\ref{eq:tauconst}).


\textbf{Part 2: $n$-causes case ($n>1$)} 

We now assume ${(\tau_k',\omega_k',\pi')}_{k=1..n}$ such that the loss vanishes, but no such solution for $n+1$ causes.

\textbf{Properties of exact $n$-cause solutions}
Then such a solution can be linked to the 1-cause solution, which is guaranteed to exist according to our set of assumptions.
Indeed, the existence of the n-cause solution implies that the pushfoward interventional distribution of the low-level causal model by $\tau$, $\widehat{P}$ satisfies
\[
\widehat{P}^{(\ib)}(Y|\Zb=\zb) \sim \Ncal\left(\sum_k \alpha_k z_k+\beta, \sigma_{Y|Z}^2\right)
\]
and
\[
\widehat{P}^{(\omega(\ib))}(\Zb) \sim \prod_k \widehat{P}^{(0)}(Z_k-\omega_k(\ib_k))
\]
If we define the aggregate cause $\tilde{Z}=\sum_k \alpha_k z_k=\sum_k \alpha_k\tau_k(x)$, then we can rewrite the above model as
\[
\widehat{P}^{(\ib)}(Y|\tilde{Z}=\tilde{z}) \sim \Ncal\left(\tilde{z}+\beta, \sigma_{Y|Z}^2\right)
\]
and
\[
\widehat{P}^{(\omega(\ib))}(\tilde{Z}) \sim \prod_k \widehat{P}^{(0)}(\tilde{Z}-\sum_k\omega_k(\ib_k))
\]
which implies that concatenating the $\tau_k$ with multiplicative coefficient $\alpha_k$ leads to a valid 1-cause TCR, and must thus match the expressions we have found for it, up to a multiplicative constant.
\par
Moreover, the interventional consistency of the $n$-causes, which do not influence each other according to the assumed high-level causal graph, entails that any low-level intervention $\ib$ affects only high-level variable $Z_k$ through is components in $\pi(k)$. 
\par
We define $A_{1..n,1..n}$ by reordering the indices of $\Omega$ according to the assumed alignment $\pi$.
Consistency then implies (using the above 1 cause solution proof)
\begin{equation}\label{eq:ncausematconst}
    \begin{bmatrix}
    \bar{\taub}_{1}^\top ,&\dots,&0\\
    \vdots,&\ddots,&\vdots\\
    0,&\dots,&\bar{\taub}_{n}^\top
\end{bmatrix} (I_{\sum_k N_k}-A_{1..n,1..n})^{-1}\ib_{\Omega}=\begin{bmatrix}
    \omegab_{1}^\top \ib_{\pi_1}\\\vdots\\\omegab_{n}^\top \ib_{\pi_n}
\end{bmatrix}\,.
\end{equation}
This implies that 
\begin{equation}\label{eq:ncausevanish}
    \bar{\taub}_{k}^\top (I_{\sum_k N_k}-A_{1..n,1..n})^{-1}_{kj}=0\,\mbox{ for all }j\in \pi(l),l\neq k\,,
\end{equation}
where $(.)_{kj}$ indicates the matrix block corresponding to indices in $\pi(k)\times \pi(j)$.
Because the non-vanishing coefficients of $\bar{\tau_k}$ reflect the influences along the causal pathways from nodes of $\pi(k)$ to $Y$, the above entails that $(I_{\sum_k N_k}-A_{1..n,1..n})^{-1}_{kj}$ must vanish on the support of $\bar{\tau_k}$.
Indeed, otherwise Eq.~(\ref{eq:ncausevanish}) would indicate that causal pathways from nodes in $\pi(k)$ to $\pi(0)$ cancel each other, which is forbidden by our assumptions. 
\par
We thus deduce that any off-diagonal block element of $(I_{\sum_k N_k}-A_{1..n,1..n})^{-1}$ whose row component belongs to the support of any $\tau_k$ and whose column component belongs to the support of any $\omega_k$ must vanish. Indeed, otherwise the causes would influence each other.
In essence, this means that any node influencing the target must not influence any node though some causal pathway in another group with the same property.

\textbf{Identifiability of the $n$-cause solution}
We assume $n=n_{max}$ and consider a second $n$-cause solution, which we denote: ${(\tau_k',\omega_k',\pi')}_{k=1..n}$. 

\textit{If} $\pi'=\pi$ up to a permutation of the order of the causes and removal of low-level variables that do not belong to the support of neither any $\omega$.
Then identifiability of the corresponding 1-cause solution implies that each $\tau_k'$ is identified with each $\tau_k$ up to a multiplicative constant, because they both match the components of the 1-cause $\tau$ on their (identical) support $\pi(k)$
The same goes for $\omega_k'$ and $\omega_k$.
This corresponds to the conclusion of the Proposition. 

\textit{Otherwise}, $\pi'\neq\pi$ even up to a permutation of the order of the causes and removal of low-level variables that do not belong to the support of neither any $\tau$ nor $\omega$.
There should be an overlap between supports of omegas and taus of one cause of one solution with two different causes of the other solution.
Without loss of generality, because of the block-diagonal structure of $(I_{\sum_k N_k}-A_{1..n,1..n})^{-1}$ entailed by Eqs.~(\ref{eq:ncausematconst}-\ref{eq:ncausevanish}) for both solutions, this overlap implies that $\pi(k)$ for at least one $k$ can be further partitioned and reordered into two subgroups, such that the corresponding diagonal block of $(I_{\sum_k N_k}-A_{1..n,1..n})^{-1}$ can be turned in a block diagonal submatrix. 
This can be used to build a new alignment $\pi''$ for $n+1$ causes, and its associated tau and omega maps such that Eq.~(\ref{eq:ncausematconst}) will be again satisfied, leading to interventional consistency of the (n+1)-causes. Moreover, because the exogenous variables in $\Omega$ are assumed independent, the block diagonal structure of the newly defined matrix $(I_{\sum_k N_k}-A_{1..n+1,1..n+1})^{-1}$ entails that the covariance of the $n+1$ high-level cause variable will be diagonal, ensuring mutual independence between high-level causes.
This procedure exhibits the existence of an $(n+1)$-cause TCR, contradicting the original assumption that $n=n_{max}$.
This case is thus excluded.
\end{proof}



\section{ADDITIONAL THEORY}\label{app:suptheory}

\subsection{Reparametrizations  of reductions}\label{app:reparam}
In order to study invariance properties of TCR, we define transformations compatible with a class of reductions.  Let $\rho:\Zcal_1 \times \dots \times \Zcal_n \to \Zcal_1 \times \dots \times \Zcal_n $ be a continuous invertible transformation of the $n$-dimensional high-level cause vector. Then the transformation
\[
\tilde{\rho}:
\begin{bmatrix}
    Y\\
    \Zb
\end{bmatrix}
\mapsto
\begin{bmatrix}
        Y\\
    \rho(\Zb)
\end{bmatrix}
\]
is also continuous invertible. Among this class of transformations, we define an invertible reparametrization of a TCR as follows.

\begin{definition}\label{def:reparam}
An invertible reparametrization of a reduction for the class $\Tcal$ of $\tau$-maps and the class $\{\Hcal_{\gammab} \}_{\gammab\in\Gamma}$ satisfies the following properties. 
\begin{itemize}
    \item it is \emph{compatible} with the class of $\tau$-maps as follows: for any map $\tau\in\Tcal$, we have $\tilde{\rho}\circ \tau\in \Tcal$,
    \item it is \emph{compatible} with the high-level model class $\{\Hcal_{\gammab}\}$ as follows: for any 
model parameter $\gammab$, the unintervened and intervened distributions $P_{\Hcal,\gammab}(Y,\Zb)$  are such that there exist a parameter $\gammab'$ and a map between high-level interventions $\psi:\Jcal\to \Jcal$ such that the joint distributions of the transformed variables $(Y,\rho(\Zb))$ is compatible with unintervened and intervened distributions of $\Hcal_{\gammab'}$, in the sense that
\[
\tilde{\rho}_\#[P_{\Hcal,\gammab}^{(\jb)}(Y,\Zb)]=P_{\Hcal,\gammab'}^{(\psi(\jb))}(Y,\Zb)\,.
\]
\end{itemize}    
\end{definition}

\subsection{The case of a single target low-level variable}\label{app:singtar}
Whenever $\pi(0)$ is a singleton, $\tau_0$ is univariate and the target $Y$ essentially corresponds (up to trivial rescaling) to a single low-level variable.
We elaborate on the interpretation of Proposition~\ref{prop:analytic_solution} in this context.

Let us set $\bar{\taub}_0=1$ and fix the target index such that $\pi(0)=\{N\}$ without loss of generality.
Then the DAG constraints entail $A_{00}=0$ and the structural equations take the form
\begin{align}\label{eq:subsystsingletarget}
\Xb_{\pi(1)}&\coloneqq A_{11} \Xb_{\pi(1)}+\Ub_{\pi(1)} +\ib \,,\quad U_k \sim \Ncal(\mu_k,\sigma_k^2)    \\
Y &\coloneqq \ab_{01}^\top \Xb_{\pi(1)}+U_N
\end{align}
where $\ab_{01}$ is a column vector of coefficients of the low-level mechanism linking the target $Y$ to its causes in $\pi(0)$. 
Then the unique linear 1D TCR, up to a multiplicative constant, making the consistency loss vanish is given by
\begin{align}
\bar{\taub}_1 =&\, \ab_{01} \label{eqn:analyitcal_tauappsingelton}\\
\text{and} \quad \bar{\omegab}_1 =&\,  (I_{N-1}-A_{11})^{-\top}\bar{\taub}_1 =(I_{N-1}-A_{11})^{-\top}\ab_{01}\,. \label{eqn:analyitcal_omegaappsingleton}
\end{align}

This solution is easily interpretable: $\bar{\taub}_1$ identifies the ground truth mechanism linking $\Xb_{\pi(0)}$ to the target, while  $\bar{\omegab}_1$ traces the contribution of interventions on each endogenous variable to the target.
Indeed, this contribution is given by the ``reduced form'' map between exogenous values and endogenous values (see proof of Proposition~\ref{prop:analytic_solution} for more insights)
\[
\ib \mapsto (I_{N-1}-A_{11})^{-1}\ib\,,
\]
and by composing this mapping with mechanism $\ab_{01}$ we get the (shift) influence of interventions on the target
\[
\ib \mapsto \ab_{01}^{\top}(I_{N-1}-A_{11})^{-1}\ib= \bar{\omegab}_1^\top \ib\,.
\]

The mismatch between $\bar{\omegab}_1$ and $\bar{\taub}_1$ is due to the internal causal structure of the submodel described by eq.~\eqref{eq:subsystsingletarget}. Indeed, if there are no causal links within this subsystem, $A_{11}$ is a zeros matrix and
\[
 \bar{\omegab}_1 =  (I_{N-1})^{-\top}\bar{\taub}_1 =\bar{\taub}_1 =\ab_{01}\,,
\]
otherwise, the two maps will be different. The discrepancy between the vectors thus reflects the fact that the causal explanation links high-level endogenous variables and interventions on them by potentially complex low-level interactions that do not necessarily have a simple high-level interpretation. This justifies regularizing the consistency loss with an homogeneity loss in order to focus on explanations that exhibit congruent $\tau$ and $\omega$  maps.

\subsection{The case of linear chain SCMs}\label{app:linchain}

In the case of a chain SCM 
\[
\Xb_1\to \dots \to \Xb_{N-1} \to \Xb_{N}=Y
\]
the above linear setting gets the additional constraints (using a causal ordering of the variables) that the target's mechanism is sparse
\[
\ab_{01}^{\top}= [0,\,\dots,\,0,\,a_{N}]
\]
and the structure matrix of $\Xb_{\pi(1)}$ is subdiagonal 
\[
A_{11}=
\begin{bmatrix}
    0& 0 &\dots & 0& 0\\
    a_2&0 &\dots &0& 0\\
    0& a_3 &\dots &0& 0\\
     0& 0  &\dots &0& 0\\\
    0&0 &\dots & a_{N-1} & 0\\
\end{bmatrix}
\]
and as a consequence, the solution writes
\begin{align}
\bar{\taub}_1 =&\,[0,\,\dots,\,0,\,a_{N-1}]^\top\label{eqn:analyitcal_tauappchain}\\
\text{and} \quad \bar{\omegab}_1 =&\,  (I_{N-1}-A_{11})^{-\top}\bar{\taub}_1 =
\begin{bmatrix}
a_{2}.a_{3}.\dots .a_{N-1} \\
\vdots\\
a_{N-2} a_{N-1}\\
 a_{N-1}    
\end{bmatrix}
\,. \label{eqn:analyitcal_omegaappchain}
\end{align}
This solution is in line with our experimental results:
\begin{itemize}
    \item $\bar{\taub}_1$ has all its weight on the parent of the target.
    \item $\bar{\omegab}_1$ has a non-sparse distribution over the chains, decaying in the upstrean direction. This reflects that structure coefficients of $A_{11}$ are selected with absolute value inferior to one, such that the influence of ancestor nodes on the target decays with their distance to it on the graph. 
\end{itemize}

Transposing the chain example to the case of Proposition~\ref{prop:anasol2}, we can take the case were the direct parent $X_{N-1}$ of the target is left unintervened. In such a case, $\bar{\taub}_1$ may put its weight on both $X_{N-1}$ and its direct parent $X_{N-2}$, Proposition~\ref{prop:analytic_solution} provides two example solutions for different choices of $\pi(1)$, including or excluding $X_{N-1}$. In the most extreme case of dissimilarity between $\taub_1$ and $\omegab_1$, solution including $X_{N-1}$ in $\pi(1)$ puts all $\taub_1$'s weight on $X_{N-1}$, while  $\omegab_1$ has no weight on it (because it is unintervened). As a consequence, $\omegab_1$ and $\taub_1$ are orthogonal and the associated homogeneity loss vanishes. In contrast, the unique solution excluding $X_{N-1}$ from $\pi(1)$ have a larger cosine similarity and will thus be preferred by the homogenity-regularized loss.

\subsection{Loss of identifiability through unintervened variables}\label{app:nointernoident}

\begin{figure*}[htb]
\centering
    \includegraphics[width=0.5\linewidth]{figures/chainv2.pdf}
	\caption{\small
	\textbf{1-cause TCR solutions on a chain graph.}
        Arrows indicate non-zero coefficients of each map.
        (a) Unique solution $\taub_1$ when interventions are performed on all nodes except the target.
        (b) Two solutions $\taub_1$ and $\taub'_1$ when only the first two nodes are intervened on.
        }
    \label{fig:chaintheory}
\end{figure*}

\begin{restatable}{proposition}{anasoltwo}\label{prop:anasol2}
Consider the setting of Prop.~\ref{prop:analytic_solution} with the exception that $\Omega \subsetneqq \pi(1)$ such that there is now a non-empty subset $S= \pi(1)\setminus \Omega$, such that $\Xb_{\Omega} \to \Xb_{S} \to \Xb_{\pi(0)}$. 
Then there exist at least two different linear 1D TCR such that $\Lcal_\mathrm{cons}=0$.
\end{restatable}
This result can also be illustrated with a chain graph, as shown in Fig.~\ref{fig:chaintheory}(b). 
If the parent node $X_3$ of $Y=X_4$ is unintervened, then one may choose either $Z_1=X_2$ or $Z_1=X_3$ (matching the solution of Fig.~\ref{fig:chaintheory}(a)) to minimize $\Lcal_\mathrm{cons}$. 
This is because both variables are equivalently mediating all performed interventions to $X_4$.
Note that each choice has it own benefit: $Z_1=X_3$, as a direct parent of $Y$, is a better statistical predictor of the value of $Y$.
However, if we focus on causal interpretability of the high-level representation, $Z_1=X_2$ is preferable because it is one of the variables intervened on at the low-level as enforced by the prior $P(\ib)$, and such that it will be associated to a non-zero weight in $\omegab_1$ for any solution satisfying $\Lcal_\mathrm{cons}=0$. 
\par


\begin{proof}
The low-level model follows the following SCM, with $P(\ib)$ non-trivial
\[
\Xb\coloneqq A \Xb+\Ub +\ib \,,\quad U_k \sim \Ncal(\mu_k,\sigma_k^2)
\]
such that $\Xb$, $A$ and $P$ take the block forms
\[
\Xb = \begin{bmatrix}
\Xb_{\pi(0)}\\\Xb_S\\ \Xb_{\Omega}
\end{bmatrix}\,,\quad 
A=
\left[
\begin{matrix}
A_{00}\, & A_{0S}&\boldsymbol{0}\\
\boldsymbol{0} & A_{SS} & A_{S\Omega}\\
\boldsymbol{0}\,& \boldsymbol{0}& A_{\Omega \Omega}
\end{matrix}
\right]\,,
\]
with $\pi(0)$ of size $N_0$, and $\pi(1)$ of size $N_1=N-N_0$ and $S$ of size $s$.
Then we know from the Proposition~\ref{prop:analytic_solution} that there is already a valid solution using $\pi$ as alignment. The only difference is that variables in $S$ are unintervened, which does not affect the ability of the solution to achieve $\Lcal_\mathrm{cons}=0$.
That solution would be compatible with interventions on $S$, but since $S$ is unintervened, we do not have uniqueness guarantees for this choice of $\pi$. 

Alternatively, if we choose $\pi'(0)=\pi(0)\cup S$ and $\pi'(1)=\pi(1)\setminus S=\Omega$, then, we can again apply Propostion~\ref{prop:anasol2}, and see that it provides a different solution with this alignment, which is compatible with the given problem (constructive transformation with constraint on the mapping $\tau_0$). 
Importantly, the key indeterminacy is for the map $\tau_1$, which will either put all its weight on elements in $S$ (direct parents of $\pi(0)$), or alternatively, put all its weights on elements in $\Omega$.
There is an additional, but trivial, indeterminacy for the map $\omega_1$: indeed, since $X_{S}$ is unintervened (as part of $\pi(0)$), the weights in $\omegab_1$ associated to these coefficients may take arbitrary values (since their associated component in $\ib$ remains zero). We do not consider these trivial indeterminacies (which do not affect the mapping $\omega_1$ on its domain, i.e. the support of the prior $P(\ib)$) by forcing the weights of $\omegab_1$ associated to unintervened variables to zero. 
\end{proof}


\subsection{Connection to causal abstractions}

\begin{proposition}\label{prop:abs_solution}
Assume
 the low-level SCM follows 
\[
\Xb\coloneqq A \Xb+\Ub +\ib \,,\quad U_k \sim \Ncal(\mu_k,\sigma_k^2)\,,\sigma_k^2>0\,,\,\,\ib\sim P(\ib)
\]
such that $\Xb$ and $A$  take the block forms
\[
\Xb = \begin{bmatrix}
\Xb_{\pi(0)}\\ \Xb_{\pi(1)}
\end{bmatrix}\,,\quad 
A=
\left[
\begin{matrix}
A_{00}\, & A_{01}\\
\boldsymbol{0}\, & A_{1 1}
\end{matrix}
\right]\,.
\]
Given an arbitrary choice of linear scalar target of the form $Y=\tau_0^\top \Xb=\bar{\tau}_0^\top \Xb_{\pi(0)}$, under the conditions of Proposition~\ref{prop:analytic_solution},  
the unique linear 1-cause TCR (up to a multiplicative constant) is associated to a 1-cause constructive abstraction
 given by 
\begin{align}
\bar{\taub}_1 =& \, A_{01}^\top (I_{\#\pi(0)}-A_{00})^{-\top}\bar{\taub}_0 \label{eqn:analyitcal_tau_abs}\\
 \quad \bar{\omegab}_1 =& \,  (I_{\#\pi(1)}-A_{1 1})^{-\top}\bar{\taub}_1\,,\\
 \bar{\taub}_{U,0}=& \bar{\taub}_0^\top(I_{N_0}-A_{00})^{-1}\,,\\\text{and }
 \bar{\taub}_{U,1}=& \bar{\tau}_1^\top (I_{N_1}-)^{-1}\,. \label{eqn:analyitcal_omega_abs}
\end{align}
\end{proposition}

\begin{proof}
    To have a valid constructive causal abstraction, we need to verify the existence of an additional constructive map $\tau_U$ for exogenous variables such that for all realizations $\ub$ of $\Ub$. 
    \[
    \tau (\Lcal^{(\ib)}(\ub))=\Hcal^{(\omega(\ib))}(\tau_U(\ub))
    \]
where $\Lcal^{(\ib)}(.)$ and $\Hcal^{(\jb)}(.)$ denote the mappings from endogenous to exogenous variable for the low- and high-level intervened models, respectively. 

Using
    \[\Lcal^{(\ib)}(\Ub)=
\Xb^{(\ib)} = (I_N-A)^{-1}(\Ub+\ib)
\]
we get
    \[
\Xb_{\pi(1)}^{(\ib)} = (I_{N_1}-)^{-1}(\Ub_{\pi(1)}+\ib_{\pi(1)})
    \]
    and 
    \[
\Xb_{\pi(0)}^{(\ib)} = (I_{N_0}-A_{00})^{-1}\Ub_{\pi(0)}+ (I_{N_0}-A_{00})^{-1}A_{01}\Xb_{\pi(1)}^{(\ib)}
    %=(I_{N_0}-A_{00})^{-1}\Ub_{\pi(0)}+(I_{N_0}-A_{00})^{-1}A_{01}((I_{N_1}-)^{-1}(\Ub_{\pi(1)}+\ib))
    \,.
    \]
    Applying the $\tau_0$ map we get  the solution of Prop.~\ref{prop:analytic_solution}
      \[
Y^{(\ib)}=\bar{\taub}_0^\top\Xb_{\pi(0)}^{(\ib)} = \bar{\taub}_0^\top (I_{N_0}-A_{00})^{-1}\Ub_{\pi(0)}+\bar{\taub}_0^\top  (I_{N_0}-A_{00})^{-1}A_{01}\Xb_{\pi(1)}^{(\ib)}
    =\bar{\taub}_0^\top(I_{N_0}-A_{00})^{-1}\Ub_{\pi(0)}+\bar{\taub}_1^\top \Xb_{\pi(1)}^{(\ib)}
    \]
    moreover applying the ${\tau}_1$ map to the variables in $\pi(1)$ we get
      \[
\bar{\tau}_1^\top \Xb_{\pi(1)}^{(\ib)} = \bar{\tau}_1^\top (I_{N_1}-)^{-1}(\Ub_{\pi(1)}+\ib_{\pi(1)})
= \bar{\tau}_1^\top (I_{N_1}-A_{\Omega\Omega})^{-1}\Ub_{\pi(1)}
+\bar{\omegab}_1^\top \ib_{\pi(1)}
    \]
    So by defining the vectors $\bar{\taub}_{U,1}^\top=\bar{\tau}_1^\top (I_{N_1}-)^{-1}$ and $\bar{\taub}_{U,0}^\top=\bar{\taub}_0^\top(I_{N_0}-A_{00})^{-1}$ we get a valid constructive abstraction linking the low-level map $\ub\mapsto \Lcal^{(\ib)}(\ub)$ to the following high-level map (first component is $Y$, second is the cause $Z_1$)
    \[
\rb \mapsto    \Hcal^{(\jb)}(\rb) = \begin{bmatrix}
        r_0+r_1+j_1\\
        r_1+j_1
    \end{bmatrix}
    \]
    such that for all $\ub$
    \[
\Hcal^{(\bar{\omegab}_1^\top\ib_{\pi(1)})}\left(
\begin{bmatrix}
  \bar{\taub}_{U,0}^\top \ub_{\pi(0)}\\
    \bar{\taub}_{U,1}^\top \ub_{\pi(1)}
\end{bmatrix}
\right)=\tau ( \Lcal^{(\ib)}(\ub))
    \]\,.
\end{proof}

\section{ALGORITHM DETAILS}\label{app:supalgo}

\subsection{Gaussian consistency loss}\label{app:gaussloss}

As the KL divergence is hard to estimate in the non-parametric setting, we make a Gaussian approximation of this loss to get an analytical, differentiable expression.
Using the general formula for two n-dimensional Gaussian densities $P$ and $Q$
\[
\mathrm{KL}(P||Q) = \frac{1}{2}
\left[ 
(\mu_Q-\mu_P)^\top \Sigma_Q^{-1}(\mu_Q-\mu_P)+\mbox{tr}(\Sigma_Q^{-1}\Sigma_P)-\log \frac{|\Sigma_P|}{|\Sigma_Q|}-n
\right] \,.
\]

Parameters of the reduction are $\taub_k,\mub_Z,\mu_{Y|Z},f:z\to f(z),\omegab_k$
with
\begin{align*}
 {Z}^{(\omega(\ib))} &\sim {P}(z)=\Ncal(\mub_Z+W \ib,\Sigma_{\Zb}\,, \mbox{ with }W=[\omegab_1,...,\omegab_n]^\top\mbox{ and }\Sigma_{\Zb}=\mbox{diag}(\sigma_{Z,1}^2,...,\sigma_{Z,n}^2))\\
 {Y}^{(\omega(\ib))}|\zb &\sim {P}(Y|z)=\Ncal(f(\zb),\sigma_{Y|\Zb}^2)\,,\\
 \hat{Z}^{(\ib)} &= [\taub_1,...,\taub_n]^\top \Xb^{(\ib)}= T\Xb^{(\ib)}\,,\\
 \hat{Y}^{(\ib)} &= \taub_0^\top \Xb^{(\ib)} \,.
\end{align*}
Moreover, we estimate the second order properties of the simulator distribution for each intervention $\ib$
\begin{align*}
\hat{\boldsymbol{\mu}}^{(\ib)}_{\Xb} &= \langle \Xb^{(\ib)} \rangle\,,\\
\hat{\Sigma}^{(\ib)}_{\Xb} &=\left\langle\left(\Xb^{(\ib)}-\hat{\boldsymbol{\mu}}^{(\ib)}_{\Xb}\right)^\top \left(\Xb^{(\ib)}-\hat{\boldsymbol{\mu}}^{(\ib)}_{\Xb}\right)\right\rangle\,,\\
    \hat{\mub}^{(\ib)}_{\Zb} &= \langle \hat{Z}^{(\ib)} \rangle = T \hat{{\mub}}^{(\ib)}_{\Xb}\,,\\
    \hat{\mu}^{(\ib)}_Y &= \langle \hat{Y}^{(\ib)} \rangle = \taub_0^\top \hat{{\mub}}^{(\ib)}_{\Xb}\,,\\
        {\widehat{\Sigma}}^{(\ib)}_{\Zb} &=\left\langle \left(\widehat{\Zb}^{(\ib)}-\hat{\mub}^{(\ib)}_Z\right)\left(\widehat{\Zb}^{(\ib)}-\hat{\mub}^{(\ib)}_Z\right)^\top \right\rangle = T \hat{\Sigma}^{(\ib)}_{\Xb} T^\top\,,\\
    {\widehat{\sigma^2}}^{(\ib)}_{Z,k} &= \left({\widehat{\Sigma}}^{(\ib)}_{\Zb}\right)_{k,k}=\left\langle \left(\widehat{Z}_k^{(\ib)}-\hat{\mu}^{(\ib)}_{Z,k}\right)^2 \right\rangle = \taub_k^\top \hat{\Sigma}^{(\ib)}_{\Xb} \taub_k\,,\\
    {\widehat{\sigma^2}}^{(\ib)}_Y &=\left\langle \left(\hat{Y}^{(\ib)}-\hat{\mu}^{(\ib)}_Y\right)^2 \right\rangle = \taub_0^\top \hat{\Sigma}^{(\ib)}_{\Xb} \taub_0\,,\\
    \widehat{\boldsymbol{c}}^{(\ib)}_{\Zb Y} &=\left\langle \left(\hat{Y}^{(\ib)}-\hat{\mu}^{(\ib)}_Y\right)\left(\hat{\Zb}^{(\ib)}-\hat{\mub}^{(\ib)}_Z\right) \right\rangle = T \hat{\Sigma}^{(\ib)}_{\Xb} \taub_0\,,
\end{align*}
where $\langle \cdot \rangle$ denotes the empirical average. Using the KL between Gaussian variables, we can rewrite the consistency loss as
\begin{multline}
    \Lcal_\mathrm{cons}=\EE_{\ib\sim p(\ib)}\left[ \mathrm{KL}_z(\hat{P}^{(\ib)}(z)|\hat{P^{(\omega(\ib))}}(z))
    \right]
    +\EE_{z\sim \hat{P}^{(\ib)}\left(\Zb\right)}\left[
        \mathrm{KL}_Y\left(
            \hat{P}^{(\ib)}\left({Y}|{\Zb}=z\right)||{P}^{(0)}\left({Y}|{\Zb}=z\right)
        \right)
    \right]\\
    =\frac{1}{2}\EE_{\ib\sim p(\ib)}\left[\sum_k\left(
            \frac{({\mu}_{Z,k}+\omegab_k^\top \ib -\hat{\mu}^{(\ib)}_{Z,k})^2}{\sigma_{Z,k}^2} 
            + \frac{{\widehat{\sigma^2}}^{(\ib)}_{Z,k}}{\sigma_{Z,k}^2} 
            \right)
            -\ln{\left(\frac{|{\widehat{\Sigma}}^{(\ib)}_{\Zb}|}{\prod_k \sigma_{Z,k}^2}\right)}
            -n
        \right]\\
        +\frac{1}{2}\EE_{\ib \sim p(\ib),z\sim \hat{P}^{(\ib)}\left(Z\right)}\left[
            \frac{\left(
                f(z) -\hat{\mu}^{(\ib)}_Y-\left(\widehat{\boldsymbol{c}}^{(\ib)}_{\Zb Y}\right)^\top
                \left(\widehat{\Sigma}^{(\ib)}_{\Zb}\right)^{-1}
                (\zb-\hat{\mub}^{(\ib)}_Z)
            \right)^2}{\sigma_{Y|Z}^2} \right.\\
            \left.
            + \frac{
                {\widehat{\sigma^2}}^{(\ib)}_Y
                -\left(\widehat{\boldsymbol{c}}^{(\ib)}_{\Zb Y}\right)^\top
                \left(\widehat{\Sigma}^{(\ib)}_{\Zb}\right)^{-1}\widehat{\boldsymbol{c}}^{(\ib)}_{\Zb Y}
                }{\sigma_{Y|Z}^2
            } 
            -\ln{\left(\frac{
                {\widehat{\sigma^2}}^{(\ib)}_Y
             -\left(\widehat{\boldsymbol{c}}^{(\ib)}_{\Zb Y}\right)^\top \left(\widehat{\Sigma}^{(\ib)}_{\Zb}\right)^{-1}\widehat{\boldsymbol{c}}^{(\ib)}_{\Zb Y}
                }{\sigma_{Y|Z}^2
            }\right) } -1
        \right] \,.
\end{multline}
\ifarxiv
The overall algorithm is described in the Algorithm~\ref{alg:LCPR} of main text.
\else
\subsection{Additional information}

The overall algorithm is described in the Algorithm~\ref{alg:LCPR} of main text, and it implementation can be found in the file 
\verb|targeted_causal_reduction/model/causal_pattern_reduction.py|. 
\fi


\section{Experimental Details}
\label{app:experimental_details}

\subsection{Linear Experiments}
\label{app:linear_experiments}
\begin{table}[htb]
\centering
\begin{tabular}{l c c}
\toprule
\textbf{Parameters} & \textbf{Linear} (Fig.~\ref{fig:linear_loss}) & \textbf{Two Branch} (Fig.~\ref{fig:two_branch}) \\
\midrule
learning rate $\lambda$ & $10^{-3}$ & $10^{-3}$ \\
\hline
learning rate scheduler & - & cosine annealing \\
\hline
No.\ repeated train.\ runs per seed & $1$ & $10$ \\
\hline
simulation paths $n_\mathrm{sim}$ & $10,000$ & $10,000$ \\
\hline
training epochs $N_\mathrm{ite}$ & $100$ & $600$ \\
\hline
simulation batch size $B$ & $128$ & $128$ \\
\hline
intervention batch size $B_i$ & $64$ & $512$ \\
\hline
overlap reg.\ $\eta_\mathrm{ov}$ \eqref{eqn:overlap} & $0$ & $0.1$ \\
\hline
balancing reg.\ $\eta_\mathrm{bal}$ \eqref{eqn:balancing} & $0$ & $10^{-3}$ \\
\bottomrule
\end{tabular}
\caption{\small \textbf{Experimental parameters and settings for the linear Gaussian experiments.}}
\label{tab:data_training_linear}
\end{table}

\paragraph{Sampling linear Gaussian low-level models}
For the adjacency matrix, we sample all non-zero entries uniformly in the interval $[-1, 1]$.
For general adjacency matrices, the lower triangular elements of the adjacency matrix are non-zero, where we assume that the target $Y$ has only incoming edges and the variables are arranged in topological order.
For the two-branch graph, values in the adjacency are set to zero accordingly.
For chain graphs, the first lower off-diagonal entries are non-zero.
The exogenous variables $\Ub$ and shift interventions $\ib$ are independent Gaussian with $U_j, i_j \sim \Ncal(0, 1)$ for $j=1, ..., N$.

\paragraph{Data and Training}
The data and training parameters are summarized in Table~\ref{tab:data_training_linear}.
All simulation data is generated before training and reused in each epoch.
We split the data into training ($70\%$), validation and test ($15\%$ each).
Since the training of the two-variable model would occasionally get stuck in local minima, we run each training with 10 different random initializations of the weights and select the model with the best total validation loss~\eqref{eqn:total_loss} at the end of training.
Furthermore, we use a cosine annealing learning rate scheduler with a final learning rate of $10^{-5}$.


\subsection{Double Well}
\label{app:double_well}


\paragraph{Simulation}
We model the ball moving in a double well potential $V(x) = x^4 - 4x^2$, shown in Figure~\ref{fig:double_well}(a), by the following equation of motion:
\begin{equation}
    m \ddot{x}(t) + k \dot{x}(t) + \frac{\partial}{\partial x} V(x(t)) = 0 \quad  \Rightarrow \quad m \ddot{x}(t) + k \dot{x}(t) + 4x(t)^3 -8x(t) = 0 \, ,
    \label{eqn:eom}
\end{equation}
where $x(t)$ is the position of the ball at time $t$, $\dot{x}(t)$ and $\ddot{x}(t)$ are the first and second time derivatives, respectively, $k$ is the friction coefficient and $m$ is the mass of the ball.
We can reformulate the second order ODE into a system of first order ODEs by introducing the velocity $v(t) = \dot{x}(t)$ as a variable:
\begin{align}
    \dot{x}(t) =&\, v(t) \nonumber\\
    \dot{v}(t) =&\, -\frac{1}{m} \left( k v(t) + 4x(t)^3 - 8x(t) \right) \,.
\end{align}
We solve the system of ODEs numerically on a grid of 101 time points $t_k$ for $k=0, \dots, 100$ equally spaced between $t=0$ and $t=10$ using a numerical integration method.
The initial conditions are $x(0) = -2.07414285 + 5\times10^{-7} \times \varepsilon_x$, with $\varepsilon_x\sim \mathrm{Uniform}(-1, 1)$ and $v(0) = 11$. The initial values are chosen such that there is a non-zero chance that the ball ends up in the left or right well without any additional interventions.
\par 
For shift interventions, we sample random velocity shifts $\Delta v(t_k)\sim \Ncal(0, 0.5)$.
The positions are unshifted.
In the numerical integration scheme, the shift interventions are implemented by splitting the integration domain in parts.
The ODE system is integrated from the initial conditions at $t_0$ to the next time grid at $t_1$.
Then the velocity at $t_1$ is shifted by $\Delta v(t_1)$ and used as the initial value for the next integration starting at $t_1$, and so on.
Similarly, we introduce independent stochasticity by adding noise to the velocity sampled from $\Ncal(0, 0.2)$ at each time step, mimicking intrinsic noise of the system.

\paragraph{Data and Training}
\begin{table}[htb]
\centering
\begin{tabular}{l c c}
\toprule
\textbf{Parameters} & \textbf{Double Well} (Fig.~\ref{fig:double_well})\\
\midrule
learning rate $\lambda$ & $5 \cdot 10^{-4}$\\
\hline
learning rate scheduler & - \\
\hline
No.\ repeated train.\ runs per seed & $1$ \\
\hline
simulation paths $n_\mathrm{sim}$ & $10,000$ \\
\hline
training epochs $N_\mathrm{ite}$ & $200$ \\
\hline
simulation batch size $B$ & $128$ \\
\hline
intervention batch size $B_i$ & $64$ \\
\hline
overlap reg.\ $\eta_\mathrm{ov}$ \eqref{eqn:overlap} & $0$ \\
\hline
balancing reg.\ $\eta_\mathrm{bal}$ \eqref{eqn:balancing} & $0$ \\
\bottomrule
\end{tabular}
\caption{\small \textbf{Experimental parameters and settings for the double well experiments.}}
\label{tab:data_training_double_well}
\end{table}
The data and training parameters are summarized in Table~\ref{tab:data_training_linear}.
All simulation data is generated before training and reused in each epoch.
We split the data into training ($70\%$), validation and test ($15\%$ each).

\subsection{Spring-Mass System}
\label{app:spring_mass_system}

\begin{table}[htb]
\centering
\begin{tabular}{l c c}
\toprule
\textbf{Parameters} & \textbf{4 masses with different weights} (Fig.~\ref{fig:spring_mass}) & \textbf{2 groups of masses} (Fig.~\ref{fig:spring_mass_grouped}) \\
\midrule
learning rate $\lambda$ & $10^{-4}$ & $10^{-3}$ \\
\hline
learning rate scheduler & cosine annealing & cosine annealing \\
\hline
No.\ repeated train.\ runs per seed & $5$ & $5$ \\
\hline
simulation paths $n_\mathrm{sim}$ & $10,000$ & $10,000$ \\
\hline
training epochs $N_\mathrm{ite}$ & $4,800$ & $1,800$ \\
\hline
simulation batch size $B$ & $128$ & $128$ \\
\hline
intervention batch size $B_i$ & $64$ & $512$ \\
\hline
overlap reg.\ $\eta_\mathrm{ov}$ \eqref{eqn:overlap} & $0.1$ & $0.1$ \\
\hline
balancing reg.\ $\eta_\mathrm{bal}$ \eqref{eqn:balancing} & $0.1$ & $0.1$ \\
\hline
spring constant $k$ & $10^{-3}$ & $10^{-3}$ \\
\hline
rest length $u_0$ & 1 & 1 \\
\hline
masses $m_i$ & $(0.5, 0.83, 0.17, 1.5)$ & all $1$ \\
\bottomrule
\end{tabular}
\caption{\small \textbf{Experimental parameters and settings for the spring mass system experiments.}}
\label{tab:data_training_spring_mass}
\end{table}

\paragraph{Simulation}

Let $M$ be the number of masses.
Then, $m_i\in \RR$, $\tilde{\vec{x}}_i(t)\in \RR^2$ and $\vec{v}_i(t)\in \RR^2$ represent the weight, position and velocity of mass $i=1, \ldots, M$ at time $t$.
$A\in \{0, 1\}^{M\times M}$ is the adjacency matrix encoding the spring connections, where $A_{ij}=1$ indicates that a spring connects masses $i$ and $j$.
The rest length at which the springs exert no force is denoted by $u_0$ and $k$ is the spring constant.
Both $u_0$ and $k$ are assumed to be the same for all springs.

The total force acting on mass $i$ at time $t$ is given by
\begin{equation}
    \vec{F}_i(t) = -k \sum_{j, A_{ij}=1} \left( \| \vec{u}_{ij}(t) \| - u_0 \right) \frac{\vec{u}_{ij}(t)}{\| \vec{u}_{ij}(t) \|}
\end{equation}
where $\vec{u}_{ij}(t) = \vec{x}_i(t) - \vec{x}_j(t)$ is the displacement vector from mass $j$ to mass $i$.
The equations of motion are
\begin{equation}
    \frac{\mathrm{d} \tilde{\vec{x}}_i(t)}{\mathrm{d} t} = \vec{v}_i(t), 
    \qquad
    \frac{\mathrm{d} \vec{v}_i(t)}{\mathrm{d} t} = \vec{a}_i(t),
    \quad
    \text{with}
    \quad
    \vec{a}_i(t) = \frac{\vec{F}_i(t)}{m_i} \, .
\end{equation}
We assume that the masses have no volume and do not collide or interact other than the forces coming from the springs.

We solve the system of ODEs numerically on a grid of 21 time points $t_k$ for $k=0, \dots, 20$ equally spaced between $t=0$ and $t=100$ using a numerical integration method.
The positions are initially set on a grid to $\tilde{\vec{x}}_1(t=0) = (0,0) + \tilde{\vec{x}}_\mathrm{offset}$, $\tilde{\vec{x}}_2(t=0) = (1,0) + \tilde{\vec{x}}_\mathrm{offset}$, $\tilde{\vec{x}}_3(t=0) = (0,1) + \tilde{\vec{x}}_\mathrm{offset}$ and $\tilde{\vec{x}}_4(t=0) = (1,1) + \tilde{\vec{x}}_\mathrm{offset}$, where $\tilde{\vec{x}}_\mathrm{offset} \sim \Ncal(0, 10)$ is a random offset that shifts the entire system.
The initial velocities are independently drawn as $\vec{v}_i(t=0)\sim \Ncal(0, 0.01)$.
We apply random independent velocity shifts $\Delta \vec{v}_i(t_k)\sim \Ncal(0, 0.005)$ at each time step and integrate it into the ODE solver in the same way as for the double well experiment in App.~\ref{app:double_well}.

The feature vectors $\Xb$ used to learn the TCR of the spring-mass system consists of all velocity values for all masses across all simulated time points.
The interventions $\ib$ are the corresponding velocity interventions.

\paragraph{Data and Training}
The data and training parameters are summarized in Table~\ref{tab:data_training_spring_mass}.
All simulation data is generated before training and reused in each epoch.
We split the data into training ($70\%$), validation and test ($15\%$ each).
Similar to the experiments on the two-branch linear graph in App.~\ref{app:linear_experiments}, we repeat the training runs with different weight initializations and use a cosine annealing learning rate scheduler.



\subsection{Additional results}\label{app:additres}
\subsubsection{Spring-Mass System without Regularization}
\label{app:spring_mass_noreg}
\begin{figure*}[htb]
\centering
    \includegraphics[width=0.5\linewidth]{figures/mass_spring_noreg.pdf}
	\caption{\small
	\textbf{Spring-mass system experiment without regularization.}
        Same experimental setup as described in Sec.~\ref{ssec:spring_mass_system} and App.~\ref{app:spring_mass_system} with the regularization turned off, \textit{i.e.}\ $\eta_\mathrm{ov}=\eta_\mathrm{bal} = 0$.
        The learned high-level mechanism is $f(\Zb) \approx -0.180 Z_1 + 0.125 Z_2 $.
        }
    \label{fig:spring_mass_noreg}
\end{figure*}

When running the TCR algorithm without regularization, it cannot be ensured that the found solutions correspond to different properties of the low-level system, as shown in Fig.~\ref{fig:spring_mass_noreg}.
There is significant mixing among the high-level variables, in particular the velocity in $x$-direction of the masses towards the end of the simulation appears in both high-level variables.

\subsubsection{Grouped Spring-Mass System}
\label{app:spring_mass_grouped}
\begin{figure*}[htb]
\centering
    \includegraphics[width=\linewidth]{figures/mass_spring_grouped.pdf}
	\caption{\small
	\textbf{Spring-mass system experiment with two groups of masses.}
        (a) Simulated system of eight point masses with equal weights connected by springs in two groups of 4 and with random initial velocity (blue arrows).
        In contrast to the experiment shown in Sec.~\ref{ssec:spring_mass_system}, target of the simulation is the center of mass speed in $x$-direction.
        (b-e) Learned $\taub$- and $\omegab$-weights corresponding to velocity components in $x$- and $y$-direction for a TCR with two high-level variables.
        The learned high-level mechanism is $f(\Zb) \approx -0.0866 Z_1 -0.0782 Z_2 $.
        }
    \label{fig:spring_mass_grouped}
\end{figure*}
We simulate two groups of four masses as shown in Fig.~\ref{fig:spring_mass_grouped}(a).
In contrast to the experiment shown in Sec.~\ref{ssec:spring_mass_system}, all masses have equal weight and the target is the center of mass velocity in $x$-direction at the end of the simulation.
The data and training parameters are summarized in Table~\ref{tab:data_training_spring_mass}.
\par
Since the only interactions between masses are mediated by the springs, as described in App.~\ref{app:spring_mass_system}, the two groups of masses do not influence each other and are thus fully independent.
The learned TCR identifies the two groups of masses as the two independent causes of the target.
This is reflected in the parameters shown in Fig.~\ref{fig:spring_mass_grouped} (b-e), where high-level variable $Z_1$ is predominantly influenced by the behavior of the second group (yellow) and variable $Z_2$ by the first group (blue).
Furthermore, we observe that the $y$-component of the velocity, which is irrelevant for the target here, is ignored by the TCR and filtered out.
