\newif\ifarxiv
\arxivfalse
\PassOptionsToPackage{breaklinks,colorlinks=true, linkcolor=BrickRed, urlcolor=Blue, citecolor=Blue, anchorcolor=blue, backref=page}{hyperref}
\documentclass[accepted]{uai2024}

\ifarxiv
\usepackage[accepted]{arxiv}
\else
% \usepackage{aistats2024}
\fi

\usepackage{enumitem}
\setitemize{noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt}




\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography

\usepackage[noend]{algcompatible}
\usepackage{algorithm}

\usepackage[dvipsnames]{xcolor}    
% \usepackage{cleveref}
\input{neurips_header}
\input{definitions}

\usepackage{placeins}
\usepackage{tocloft}
\setlength{\cftbeforesecskip}{6pt}
\usepackage{listings}



\usepackage{xr}
\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
\typeout{(#1)}% latexmk will find this if $recorder=0
% however, in that case, it will ignore #1 if it is a .aux or 
% .pdf file etc and it exists! If it doesn't exist, it will appear 
% in the list of dependents regardless)
%
% Write the following if you want it to appear in \listfiles 
% --- although not really necessary and latexmk doesn't use this
%
\@addtofilelist{#1}
%
% latexmk will find this message if #1 doesn't exist (yet)
\IfFileExists{#1}{}{\typeout{No file #1.}}
}\makeatother

\newcommand*{\myexternaldocument}[1]{%
\externaldocument{#1}%
\addFileDependency{#1.tex}%
\addFileDependency{#1.aux}%
}

\myexternaldocument{supplement}


\usepackage[round]{natbib}
\bibliographystyle{plainnat}
\renewcommand{\bibsection}{\subsubsection*{References}}

\usepackage[colorinlistoftodos,prependcaption,textsize=tiny]{todonotes}

\makeatletter
 \if@todonotes@disabled
 \newcommand{\hlnote}[2]{#1}
 \else
 \newcommand{\hlnote}[2]{\todo{#2}\texthl{#1}}
 \fi
 \makeatother
 \setlength{\marginparwidth}{2cm}

\title{Targeted Reduction of Causal Models}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Armin~Keki\'c}
\author[1]{Bernhard~Sch\"olkopf}
\author[1]{Michel~Besserve}
% Add affiliations after the authors
\affil[1]{%
    Max Planck Institute for Intelligent Systems\\
    T\"ubingen, Germany
}

\begin{document}
\maketitle


\begin{abstract}\looseness-1
Why does a phenomenon occur?
Addressing this question is central to most scientific inquiries and often relies on simulations of scientific models. 
As models become more intricate, deciphering the causes behind phenomena in high-dimensional spaces of interconnected variables becomes increasingly challenging.
Causal Representation Learning (CRL) offers a promising avenue to uncover interpretable causal patterns within these simulations through an interventional lens.
However, developing general CRL frameworks suitable for practical applications remains an open challenge.
We introduce \textit{Targeted Causal Reduction} (TCR), a method for condensing complex intervenable models into a concise set of causal factors that explain a specific target phenomenon.
We propose an information theoretic objective to learn TCR from interventional data of simulations, establish identifiability for continuous variables under shift interventions and present a practical algorithm for learning TCRs.
Its ability to generate interpretable high-level explanations from complex models is demonstrated on toy and mechanical systems, illustrating its potential to assist scientists in the study of complex phenomena in a broad range of disciplines.\footnotemark\label{fn:myfootnote}
\end{abstract}


\section{INTRODUCTION}
\label{sec:introdcution}
Numerical models are indispensable in science for simulating real-world systems and generating \textit{etiological explanations}—identifying the causes of specific phenomena.
\footnotetext{\hyperref[fn:myfootnote]{\hphantom{}}Code is available at: \href{https://github.com/akekic/targeted-causal-reduction.git}{https://github.com/akekic/targeted-causal-reduction.git}.}%
General circulation models, for example, shed light on the causes of global warming \citep{grassl2000status}, while computational brain models explore the origins of neurological pathologies \citep{breakspear2017dynamic,deco2014great}.
These examples illustrate the increasing complexity of numerical scientific models, designed to faithfully capture the large number of mechanisms at play in these systems.
However, this complexity comes at a cost: expanding parameter spaces and heightened computational demands. 
This trend, in turn, impacts the ability to generate high-level explanations, understandable by scientists and decision makers \citep{reichstein2019deep,safavi2023uncovering}. 
\par \looseness -1
Effective human explanations are often based on understanding a few causal relations between a limited number of variables.
While the simulation of complex systems might rely on numerous simple mechanisms, extracting overarching causal relations between fewer relevant high-level variables remains largely unaddressed. 
In particular, while causal representation learning tries to explain data based on a learned latent causal graph \citep{liang2023causal,squires2023linear,von2023nonparametric}, it currently has theoretical and practical limitations.
CRL largely relies on preserving all information in the data to provide recoverability guaranties for the latent causes, while the idea of a high-level representation is precisely to discard irrelevant data. 
\par
In contrast, Causal Model Reduction (CMR), which aims to map a low-level causal model to a simpler high-level model with fewer or lower-dimensional variables, embraces the purpose of eliminating irrelevant information.
However, existing CMR approaches, such as causal abstractions \citep{geiger2023causal,zennaro2023jointly} and Causal Feature Learning (CFL) \citep{chalupka2016unsupervised} are not well-suited to causally describe many scientific models: they use discrete variables and typically rely on \textit{hard} interventions, disconnecting causal variables from their parents. 
The following example shows, however, that simpler high-level causal models for continuous variables and \textit{soft} interventions are natural and useful in domains such as physics. 

\begin{figure*}
\centering
     \begin{subfigure}[t]{0.56\textwidth}
         \centering
         \includegraphics[width=\linewidth]{figures/mechanalogv3.pdf}
	\caption{\small\label{fig:mechaexpl}}
     \end{subfigure}
     \hfill
     \begin{subfigure}[t]{0.37\textwidth}
         \centering
         \includegraphics[width=\linewidth]{figures/constructiveTransformationv3.pdf}
    \caption{\small
    \label{fig:constrans}}
     \end{subfigure}
     \caption{\small 
     \textbf{Targeted Causal Reduction.}
     (a) Example targeted model reduction: a model of the dynamics of a system of point masses connected by springs can be reduced to the trajectory of its center of mass.
     (b) Overview of TCR. 
     Low-level variables $\Xb$ (simulation) are mapped to high-level variables $(\Zb, Y)$ with a fixed causal structure.
     The target $Y$ is known, while the causes $\Zb$ and the high-level causal mechanism are learned.
     Additionally, we learn a mapping from low-level shift interventions $\ib$ to high-level shift interventions $\omegab(\ib)$.
     }
\end{figure*}
\looseness-1
Consider a system of point masses connected by springs shown in Fig.~\ref{fig:mechaexpl}, where each mass is influenced by random external forces. 
Its trajectory under the intervention of external forces can be accurately predicted by simulating the coupled equations of motion of individual point masses. 
However, if we are only interested in a particular macroscopic ``target'' variable of this system: the horizontal speed of the system's center of mass at the end of an experiment, a key result from classical physics is that its motion will depend only on the sum of all horizontal components of external forces applied over time.
We thus obtain a form of CMR: a much simpler system that accurately accounts for the effect of interventions in the system on the target variable.
\par
This highlights the core elements needed for CMR in scientific models:
(1) there is a clearly defined macroscopic target variable,
(2) continuous low-level variables are reduced to a smaller set of continuous high-level variables, and 
(3) low-level interventions are soft: exerted forces modify the future trajectory but do not suppress the influence of other factors such as the past state of the system.
Moreover, it is the combination of many low-level soft interventions across time that corresponds to a relevant high-level intervention. 
These three aspects are commonly found characteristics of studied real-world systems, motivating the development of CMR algorithms adapted to this setting. 
\par \looseness-1
In this paper, we introduce \emph{Targeted Causal Reduction} (TCR), depicted in Fig.~\ref{fig:constrans}, a novel approach designed to simplify complex \emph{low-level models} into \emph{high-level models}, focused on explaining causal influences on an observable target variable $Y$.
The key signal we use for learning are \emph{interventions} applied to the low-level variables, which are mapped to high-level interventions in a way that captures the causal influences on $Y$ in a concise and interpretable way.
We formulate this learning objective as a Kullback-Leibler divergence between the fitted high-level interventional model and the reduced low-level interventional distribution, leading to a practical learning algorithm for the case of linear reductions.
Applications to high-dimensional synthetic and scientific models demonstrates accuracy and interpretability of our approach. 
We refer to the appendix for more related work (\ref{app:relwork}) and proofs (\ref{app:proofs}).

\section{BACKGROUND}
\label{sec:bckgd}
\subsection{Structural causal models}
Causal dependencies between variables can be described using \textit{Structural Causal Models} (SCM)~\citep{causality_book}.
\paragraph{Notation.} We use boldface for column vectors, and $\ib_S$ for the subvector of $\ib$ restricted to the components in set $S$.
\begin{definition}[SCM]\label{def:SCM}
    An $n$-dimensional structural causal model is a triplet $\Mcal=(\mathcal{G},\mathbb{S},P_\Ub)$ consisting of:
    \begin{itemize}
        \item a joint distribution $P_\Ub$ over exogenous variables $\{U_j\}_{j\leq n}$,
        \item a directed graph $\mathcal{G}$ with $n$ vertices,
        \item a set $\mathbb{S}=\{X_j \coloneqq f_j(\textbf{Pa}_j,U_j), j=1,\dots,n\}$ of structural equations, 
        where $\textbf{Pa}_j$ are the variables indexed by the set of parents of vertex $j$ in $\mathcal{G}$,
    \end{itemize} 
    such that for almost every $\ub$, the system $\{x_j \coloneqq f_j(\textbf{pa}_j,u_j)\}$ has a unique solution $\xb=\gb(\ub)$, with $\gb$ measurable. 
\end{definition}
\looseness -1
The unique solvability condition is included in this definition because we consider a general class of SCMs by allowing cycles, that is, $\Gcal$ may not be a DAG.
Moreover, we allow hidden confounding through the potential lack of independence between the exogenous variables $\{U_j\}$.
See \citet{bongers2021foundations} for a thorough study of these models. 
Under these conditions, the distribution $P_{\Ub}$ entails a well-defined joint distribution over the endogenous variables $P(\Xb)$. 
\par
Interventions in SCMs involve replacing one or more structural equations, potentially modifying exogenous distributions, and adding or removing arrows in the original graph to reflect changes in dependencies between variables.
An intervention transforms the original model $\Mcal =(\mathcal{G},\mathbb{S},P_\Ub)$ into an intervened model $\Mcal^{(\ib)} =(\mathcal{G}^{(\ib)},\mathbb{S}^{(\ib)},P_\Ub^{(\ib)})$, where $\ib$ is the vector parameterizing the intervention.
The base probability distribution of the unintervened model is denoted $P_{\Mcal}^{(0)}(\Xb)$ or simply $P_{\Mcal}(\Xb)$ and the interventional distribution associated with $\Mcal^{(\ib)}$ is denoted $P_{\Mcal}^{(\ib)}(\Xb)$.
\par 
\looseness -1
Classical $do$-interventions set a structural equation to a constant, removing all influences of the parents on the intervened variable. 
This can be problematic for studying how the influence of low-level variables is propagated to the target, since for simultaneous interventions, the effects of some interventions can be masked by others.
The probability of such masking increases as the number of low-level variables grows.
\textit{Soft interventions}, on the other hand, modify an equation while keeping the set of parents unchanged.
This is more appropriate in our setting, since it propagates the information from all interventions to the target simultaneously.
\par \looseness -1
Large classes of soft interventions can be designed to match domain knowledge \citep{besserve2022learning}. 
Notably, \textit{shift interventions} modify the structural equation of endogenous variable $l$ through shifting it by a scalar parameter $i$
\begin{equation}
\{X_l\coloneqq f_l(\textbf{Pa}_l,U_l)\} \mapsto \{X_l\coloneqq f_l(\textbf{Pa}_l,U_l)+i\}\,.
\end{equation}
These can be combined to form multi-node interventions with vector parameter $\ib$.

\subsection{Simulations and Causal Models}
We use the term \textit{scientific model} to refer to a generative model that relies on a set of equations to represent a phenomenon.
What distinguishes such models from generative models in machine learning is their decomposability into elementary functions, encoding domain knowledge about the mechanisms being investigated. 
Simulations based on the numerical solution of scientific models can often be expressed as SCMs; 
This notably includes Ordinary (ODE)~\citep{mooij2013ordinary} and Stochastic Differential Equations (SDE)~\citep{hansen2014causal}. 
A simulator can thus be seen as a low-level causal model, from which samples of unintervened and intervened distributions can be generated.
This forms the basis of the causal framework for learning high-level explanations for simulators developed in Sec.~\ref{ssec:framew}.

\subsection{Causal Model Reductions (CMR)}

We consider as CMR any (possibly approximate) mapping from a low-level SCM $\Lcal$ to a simpler high-level SCM $\Hcal$.
An example is CFL \citep{chalupka2014visual,chalupka2016unsupervised}, which achieves a CMR by merging values of a large observation space to yield discrete high-level variables taking values in a small finite set.
Consider:
\begin{itemize}
    \item $\Lcal$ has a vector of endogenous variables $\Xb$ with range $\Xcal$ and a set of interventions $\Ical$,
    \item $\Hcal$ has a vector of endogenous variables $\Zb$ with range $\Zcal$ and a set of interventions $\Jcal$.
\end{itemize}
Starting from the distribution of the low-level model $P_{\Lcal}(\Xb)$, a deterministic mapping $\tau:\Xcal\to\Zcal$ generates a joint distribution on the high-level variables that is the push-forward distribution of $P_{\Lcal}(\Xb)$ by $\tau$, denoted $\tau_\# [P_{\Lcal}(\Xb)]$ such that
\[
\tau(\Xb) \sim \tau_\# [P_{\Lcal}(\Xb)]\,.
\]
The low-level interventional distributions can be pushed forward to the high-level in the same way.
\par
A general framework for CMR is based on the notion of \textit{exact transformation}, which ensures \textit{interventional consistency} by matching the push-forward low-level distributions to the high-level ones.
\begin{definition}[Exact transformation \citep{rubenstein2017causal}]
A map $\tau:\Xcal\to \Zcal$ is an exact transformation from $\Lcal$ to  $\Hcal$ if it is surjective, and there exists a surjective \textit{intervention map} $\omega:\Ical\to\Jcal$ such that for all  $\ib \in\Ical$
\[
\tau_{\#}[P_{\Lcal}^{(i)}(\Xb)] = P_{\Hcal}^{(\omega(i))}(\Zb)\,.
\]
\end{definition}
The set of possible $\tau$ can be restricted to \textit{constructive transformations}, where high-level variables depend only on non-overlapping subsets of low-level variables.
This eases interpretability of CMR and comes with characterization results  \citep{beckers2019abstracting,geiger2023causal}.
\begin{definition}
$\tau:\Xcal\to\Zcal$ is a constructive transformation between model $\Lcal$ and $\Hcal$ if there exists an alignment map $\pi$ relating indices of each high-level endogenous variable to a subset of indices of low-level endogenous variables such that for all $k\neq l$, $\pi(k)\cap \pi(l)=\emptyset$ and for each component $\tau_k$ of $\tau$ it exists a function $\bar{\tau}_k$ such that for all $\xb$ in $\Xcal$,
\[
\tau_k (\xb) = \bar{\tau}_k(\xb_{\pi(k)})\,.
\]
\end{definition}
The intervention map $\omega$ of constructive exact transformations are required to be constructive as well, such that acting on high-level variable $k$ depends only on low-level interventions acting on variables in $\pi(k)$ (see App.~\ref{app:supbkgd}).

\section{THEORETICAL ANALYSIS}
\label{sec:theoretical_analysis}

As described in Fig.~\ref{fig:constrans}, we consider endogenous variables of a low-level model gathered in a (high-dimensional) random vector $\Xb$.
A target scalar variable $Y=\tau_0(\Xb)$ quantifies a property of interest of this model, and can be thought of as quantifying the presence or magnitude of a \textit{phenomenon} in the data,  using \textit{detector} $\tau_0$.
To generate a high-level causal explanation of this phenomenon, we learn a high-level SCM with a fixed causal structure, where the known effect variable $Y$ is caused by $n$ learned independent high-level variables $Z_k$.
The low-level variables $\Xb$ are approximately mapped to the high-level variables ${\Zb}$ using a constructive transformation and an associated constructive interventional map with the same alignment $\pi$.

\subsection{TCR framework}\label{ssec:framew}

Our reduction framework has the following elements:
\par
\looseness-1 (1) A low-level SCM $\Lcal$
    with $N$ endogenous variables $\{X_1, \dots,X_N\}$
     and corresponding exogenous variables $\{U_k\}_{k= 1..N}$ equipped with joint distribution $P(\Ub)$. 
    A set of low-level shift interventions parameterized by vector $\ib{\in} \Ical$ with distribution $P(\ib)$, with each component $i_k$ acting on one endogenous variable $X_k$. 
    % Unintervened variables are assigned a fixed intervention parameter $i_k=0$. %and a set of model parameters $\thetab$. 
    We only assume we can sample from unintervened and interventional distributions of $\Lcal$.
\par
(2) A class of high-level SCMs $\{\Hcal_{\gammab}\}_{\gamma \in \Gamma}$ with (n+1) endogenous variables  $\{Y, {Z}_1,\dots,{Z}_n\}$ and associated exogenous variables $\{R_k\}_{k= 0..n}$, equipped with a factorized distribution $P(\Rb)=\prod P_{R_k}$.
    A set of high-level shift interventions parametrized by vector $\jb{\in} \Jcal$,  with each component $j_k$ affecting a single node $Z_k$. In contrast to the (fixed) low-level model, the high-level model parameters $\gammab$ are learned.
\par
These two levels are linked by a constructive transformation with two deterministic surjective maps $\tau$ and $\omega$ from low- to high-level endogenous variables and interventions, respectively, which decompose as
\begin{align}
    \tau & = (\tau_0,\tau_1,\tau_2,\dots,\tau_n)\mbox{ with }\tau_k:x\mapsto \bar{\tau}_k(x_{\pi(k)})\\
                \omega & = (\omega_0,\omega_1,\omega_2,\dots,\omega_n)\mbox{ with }\omega_k:\ib\mapsto \bar{\omega}_k(\ib_{\pi(k)})
\end{align}
where $\pi$ is a so-called alignment function from $[0\,..\,n]$ to non-overlapping subsets of $[1\,..\,N]$.
Importantly, $\tau_0$ (and thus $(\bar{\tau}_0,\pi(0))$) are assumed fixed and known.
Additionally, $\omega_0$ is assumed to be a trivial constant map $\ib \to 0$, to ensure that the high-level target variable cannot be directly intervened upon, as we want to explain the changes in $Y$ exclusively through changes of its high-level causes. 
\par
A high-level model involves the following mechanisms, which need to be learned:
(1) The marginal distribution of each high-level cause  $P^{(\jb)}({Z}_k)$ in all high-level interventional settings $\jb$. 
(2) The mechanism $P(Y|\Zb)$ mapping high-level causes to $Y$, comprised of the distribution of the exogenous variable $R_0$ and the structural equation
\[
(Z_1,...,Z_n, R_0) \mapsto  f_{\gammab} (Z_1,...,Z_n, R_0) \eqqcolon Y\, .
\]

\subsection{Causal consistency loss}

It is not always possible to achieve an exact transformation that guarantees consistency of low- and high-level models for almost all interventions.
As a consequence, we allow for the consistency between models to be approximate. 
To ensure that this approximation is as accurate as possible, we minimize the expected KL divergence between the pushforward by the transformation $\tau$ of the low-level interventional distributions that we denote $
\widehat{P}_{\tau}^{(\ib)}(Y,\Zb) = \tau_{\#}[P_{\Lcal}^{(\ib)}(\Xb)]$, and the corresponding interventional distribution of the high-level model $P^{(\omega(\ib))}$, leading to the consistency loss
\begin{equation}\label{eq:lcons}
    \Lcal_\mathrm{cons} =	 \EE_{\ib\sim P(\ib)} \left[ \mathrm{KL}\left( \widehat{P}_{\tau}^{(\ib)}(Y,\Zb)\|P^{(\omega(\ib))}(Y,{\Zb})\right)\right]\,.
\end{equation}
Other losses have been previously suggested to enforce consistency.
\citet{beckers2020approximate} propose to take a maximum over interventions, whereas we take the expectation in our loss, thus focusing the CMR on the average performance rather than the worst case.
\cite{rischel2021compositional} and \cite{zennaro2023jointly} use the Jensen-Shannon (JS) divergence in the context of finite models.
Instead, we choose the KL divergence because, contrary to JS, it leads to a tractable expression under Gaussian assumptions. 
Moreover, the proposed consistency loss~\eqref{eq:lcons} has the following properties. 

\begin{restatable}[Consistency loss]{proposition}{consloss}\label{prop:consloss}
The consistency loss is positive, invariant to invertible reparametrizations (see Def.~\ref{def:reparam}), and vanishes if and only if the transformation is exact for almost all interventions. It decomposes as 
\begin{multline}
\label{eq:CMdec}
   \Lcal_\mathrm{cons}\!
    =\!\EE_{i\sim P(\ib)}\!\! \left[
    \mathrm{KL}\left(
       \widehat{P}_{\tau}^{(\ib)}\left(\Zb\right)||{P}^{(\omega(\ib))}\left(\Zb\right)
    \right)\right. \\\!
    \left.+\EE_{\zb\sim \widehat{P}_{\tau}^{(\ib)}\left(\Zb\right)}\!\!\left[
        \mathrm{KL}\left(
            \widehat{P}_{\tau}^{(\ib)}\left({Y}|\zb)\right)||{P}^{(0)}\left({Y}|\zb\right)
        \right)
    \right]\right]\,,
\end{multline}
and is an upper bound of the \emph{causal relevance loss}
\begin{equation}\label{eq:Lrel}
    \Lcal_\mathrm{rel} \!=\! \EE_{\ib\sim P(\ib)}\!\left[\mathrm{KL}\left(        \widehat{P}^{(\ib)}\left(Y\right)||{P}^{(\omega(\ib))}\left(Y\right)\right)\right]\!\!\leq\!\! \Lcal_\mathrm{cons} \,. 
\end{equation}
\end{restatable}
\par
Reparametrization invariance (see Def.~\ref{def:reparam}) refers to transformations of the pairs $(\tau, f_{\gammab})$ that leave the composition $f_{\gammab}\circ \tau$ invariant. 
In the $n=1$ linear setting (see Sec.~\ref{sub:linear_reduction_with_shift_interventions}), this corresponds to invariance by multiplicative rescaling.
This guarantees that equivalent high-level causal descriptions are treated equally by the loss. 
\par
\looseness -1
We call Eq.~\eqref{eq:CMdec} a \textit{Cause-Mechanism Decomposition} because the first term quantifies the \textit{cause consistency} and the second term can be thought of as the \textit{mechanism consistency}. 
This latter term assesses the similarity between the outputs of the learned high-level mechanism $P^{(0)}(Y|z)$ and the corresponding conditional distribution computed by push-forward of the low-level variables $\widehat{P}_{\tau}^{(\ib)}(Y|z)$. 
Since we prevent the high-level mechanism from being intervened on, only its unintervened conditional appears in the expression. 
\par
Lastly, the causal relevance loss $\Lcal_{rel}$ assesses whether the variations of the target $Y$ due to low-level interventions are well-captured by high-level interventions, on average over the prior $P(\ib)$.
Its upper bounding by $\Lcal_\mathrm{cons}$ ensures that by optimizing for consistency, we also indirectly promote effective ``explanations'' of the variations in the target density resulting from low-level interventions. 
We can thus choose $P(\ib)$ to make the most relevant interventions more likely according to domain knowledge, such that optimizing the loss will steer towards a solution capturing the most domain-relevant variations of the target. 

\subsection{Linear reduction with shift interventions} 
\label{sub:linear_reduction_with_shift_interventions}
We further constrain the setting to be able to study the solution minimizing $\Lcal_\mathrm{cons}$ analytically and get insights into the properties of TCR. 
 
\paragraph{Notation.}
When a vector, say $\taub_k$, is associated to a high-level SCM component $k$ of a constructive transformation with alignment $\pi$, $\bar{\taub}_k$ indicates the restriction of $\taub_k$ to components in $\pi(k)$.
The number of elements in a set $S$ is $\#S$.

\paragraph{Tau map.}
To maximize interpretability, we assume a linear $\tau$-map, represented as a vector $\taub$ such that:
\[
\Xb \mapsto \begin{bmatrix}
Y\\
{\Zb}
\end{bmatrix} =
\begin{bmatrix}
\taub_0^\top\\\vdots,
\\
\taub_n^\top
\end{bmatrix} \Xb
=
\begin{bmatrix}
\bar{\taub}_0^\top \Xb_{\pi(0)} \,,& \dots
 \bar{\taub}_n^\top  \Xb_{\pi(n)}
\end{bmatrix}^\top .
\]


\paragraph{Omega map.}

We focus on \textit{shift interventions} and map the vector $\boldsymbol{i}$ of low-level interventions on the nodes in $\pi(k)$ to a scalar shift intervention on the mechanism of each $Z_k$.
We assume each map $\omega_k$ to be linear with vector $\omegab_k$ such that 
\[
\omega_k(\ib) = \omegab_k^\top \boldsymbol{i}=\bar{\omegab}_k^\top  \boldsymbol{i}_{\pi(k)}\, .
\]
Because high-level causes are root nodes, intervening amounts to shifting the marginal distribution
from $P^{(0)}(Z_k)$ to  $P^{(\omega_k(\ib))}(Z_k)=P^{(0)}(Z_k-\omega_k(\ib))$. 

\paragraph{Choice of alignment $\pi$.}
There are potential degrees of freedom for $\pi$, and users may want to incorporate domain knowledge as well as interpretability constraints to reduce the variables included in $\cup_{k\neq 0} \pi(k)$.
In practice, we learn the distribution of the low-level variables among the $\pi(k)$ using regularization (see Sec.~\ref{sec:linear_tcr_algorithm}).

\paragraph{High-level mechanism.}
We use an interpretable affine high-level causal mechanism $f_\gamma$, such that
\begin{equation}\label{eq:linhighlevelmech}
    \textstyle
    Y\coloneqq \sum_k \alpha_k Z_k +R_0+\beta\,,\quad \alpha_1,\dots,\alpha_n,\beta \in \RR \, .
\end{equation}



\paragraph{Choice of prior $P(\ib)$.}
The solutions minimizing the loss of Eq.~(\ref{eq:lcons}), may depend on the choice of the prior $P(\ib)$, and in particular on which variables are actually intervened on. 
Let ${\Omega}$ denote the subset of indices of low-level variables that are intervened on with non-zero probability. 
The components of $\ib$ whose index does not belong to $\Omega$ thus take value $i=0$ with probability one.
We provide identifiability guaranties under two kinds of assumptions. 

\begin{assumption}\label{assum:prior}
 $P(\ib_{{\Omega}})$ has a density with respect to the Lebesgue measure, with support covering a neighborhood of zero (\textit{i.e.}\ the unintervened case).
\end{assumption}
\begin{assumption}\label{assum:prior2} The unintervened setting $\ib_\Omega=\boldsymbol{0}$ occurs with non-zero probability. Additionally there are at least $\#\Omega$ distinct interventions happening with non-zero probability, corresponding to a family of values of the vector of $\ib_{\Omega}$ with full rank $\#\Omega$.  
\end{assumption}
While Assum.~\ref{assum:prior} depicts a practical setting where interventions are drawn from prior densities that reflect the prior knowledge on how likely those are, Assum.~\ref{assum:prior2} allows addressing a classical question in causal representation learning: \textit{How many distinct interventions are needed to learn the representation?}

\subsection{Identifiability results}\label{sec:ident}
If we assume the low-level model is linear Gaussian of the form $ \Xb_{\Omega}\to \Xb_{\pi(0)}$, we can show the existence and uniqueness of the solution.


\begin{restatable}{proposition}{anasol} \label{prop:analytic_solution}
Assume
 the low-level SCM follows 
\[
\Xb\coloneqq A \Xb+\Ub +\ib \,,\,\,\ib\sim P(\ib)\,,\quad U_k \sim \Ncal(\mu_k,\sigma_k^2)\,,\sigma_k>0\,,
\]
such that $\Xb$ and $A$  take the block forms
\[
\Xb = \begin{bmatrix}
\Xb_{\pi(0)}\\ \Xb_{\Omega}
\end{bmatrix}\,,\quad 
A=
\left[
\begin{matrix}
A_{00}\, & A_{0\Omega}\\
\boldsymbol{0}\, & A_{\Omega \Omega}
\end{matrix}
\right]\,%, quad
%\Ub = \begin{bmatrix}
%\Ub_{\pi(0)}\\ \Ub_{\Omega}
%\end{bmatrix}\,,\quad
.
\]
Given an arbitrary choice of linear scalar target of the form $Y=\tau_0^\top \Xb=\bar{\tau}_0^\top \Xb_{\pi(0)}$ and under Assum.~\ref{assum:prior} or Assum.~\ref{assum:prior2}, % if  $\Omega=\pi(1)$, 
there is a unique linear 1-cause TCR (up to a multiplicative constant) satisfying $\Lcal_\mathrm{cons}=0$. It is given by 
\begin{align}
\pi(1) =&\, \Omega\,,\\
\bar{\taub}_1 =&\, A_{0\Omega}^\top (I_{\#\pi(0)}-A_{00})^{-\top}\bar{\taub}_0 \label{eqn:analyitcal_tau}\,,\\
\text{and} \quad \bar{\omegab}_1 =&\,  (I_{\#\Omega}-A_{\Omega \Omega})^{-\top}\bar{\taub}_1\,. \label{eqn:analyitcal_omega}
\end{align}
Moreover, let $n_{max}$ be the maximum number $n$ such that a linear $n$-cause TCR can achieve $\Lcal_\mathrm{cons}=0$.
If there are no cancellations%
\footnote{Causal pathways cancel if the linear coefficients quantifying the influence of a node on $Y$ along different directed paths of the low-level SCM sum to zero. Assuming no cancellations is akin to assuming no faithfulness violations and generically satisfied \citep[Theorem 3.2]{sprites2001causation}.}
among causal pathways from each node in $\mbox{supp}(\bar{\omegab}_1)$ of Eq.~(\ref{eqn:analyitcal_omega}) towards $Y$, then the $n_{max}$-cause TCR is unique up to rescaling and permutation of the causes. 
\end{restatable}
This result provides guaranties for having a unique ground-truth solution in case exact transformations can be achieved.
The main assumption is the absence of feedback influences from the target set $\pi(0)$ to candidate causes.
However, cycles and confounding are allowed in the low-level model, contrary to the learned high-level model.
The 1-cause solution is easiest to obtain.
The study of simple SCMs (App.~\ref{app:singtar} and App.~\ref{app:linchain}) provides some insights on the form of the analytical solution.
Additional results show that we lose identifiability of the TCR if we drop the assumption that not all variables in $\pi(1)$ are intervened on (see App.~\ref{app:nointernoident}). 
The $n$-cause solution is essentially a partition of the 1-cause solution that enforces independence between them.
\par
Under the same model assumptions, the resulting constructive transformation can be associated with a constructive causal abstraction, as shown in Proposition~\ref{prop:abs_solution}.
This corresponds to a particular case of low soft abstraction introduced by \citet[Def.~9]{massidda2023causal}.
\par
\begin{restatable}[Linear chain]{example}{linear_chain} \label{ex:linear_chain}
To illustrate the solutions in Prop.~\ref{prop:analytic_solution}, we consider a linear chain
\begin{equation*}
    \underbrace{X_1 \rightarrow X_2 \rightarrow X_3}_{X_\Omega} \rightarrow \underbrace{X_4}_{X_{\pi(0)}}
\end{equation*}
with adjacency $A_{ij} = \{ 1\ \mathrm{for}\ j {=} i{+}1;\ 0 \ \mathrm{else}\}$ and target $Y=X_4$, such that $\bar{\taub}_0 = I_1$.
The 1-cause solution (up to a multiplicative constant) achieving $\Lcal_\mathrm{cons}=0$ is
\begin{equation}
    \bar{\taub}_1 = \begin{pmatrix}
        0 & 0 & 1
    \end{pmatrix}^\top
    \quad \text{and} \quad
    \bar{\omegab}_1 = \begin{pmatrix}
        1 & 1 & 1
    \end{pmatrix}^\top \, .
\end{equation}
$\bar{\taub}_0$ puts all its weight on the direct parent of target $X_4$ because it mediates all causal influences. 
In contrast, $\bar{\omegab}_1$ puts weight on all variables in $\Omega$ because interventions on any of them influence $X_4$
\end{restatable}

\section{LINEAR TCR ALGORITHM}
\label{sec:linear_tcr_algorithm}

In this section, we introduce an algorithm to learn a linear targeted causal reduction with shift interventions. 
\makeatletter
% Reinsert missing \algbackskip
\def\algbackskip{\hskip-\ALG@thistlm}
\makeatother

\begin{algorithm}%[hbt!]
\caption{Linear TCR (LTCR)%, %fixed simulation parameter.
}\label{alg:LCPR}
% \hspace*{\algorithmicindent} 
\textbf{Input} $\lambda$: learning rate, $P(\ib)$: intervention prior, \textit{Simulate}$(\thetab,\ib,n_\mathrm{sim})$: function returning $n_\mathrm{sim}$ paths, $N_\mathrm{ite}$: No.\ epochs, $B$, $B_\ib$: simulation/intervention batch size. \\
    \textbf{Initialize} $\taub_1,\omegab, \gammab$
\begin{algorithmic}
\FOR{ $m = 1 .. N_\mathrm{ite}$}
    \STATE $X,Y\gets []$
    \FOR{ $l = 1 .. B_\ib$}
        \STATE $\ib_l \gets Sample(P(\ib))$
        \STATE $X_l=(\xb^1,.., \xb^B)$;  $Y_l \gets Simulate(\thetab,\ib_l,B) $
        \STATE $X\gets [X[:],X_l]$;  $Y\gets [Y[:],Y_l]$; $I\gets [I[:],\ib_l]$
        % \STATE $Y\gets [Y[:],Y_l]$
%        \STATE $\mu_X\gets [\mu_X[:],Mean(X_l)]$
        % \STATE $I\gets [I[:],\ib_l]$
    \ENDFOR 
\STATE $L_\mathrm{tot}\gets ComputeLoss(X,Y,I,\taub_1,\omega_1,\gammab)$
\STATE $\nabla_{\gammab}, \nabla_{\taub}\gets ComputeLossGradient(L_\mathrm{tot})$
\STATE $(\gammab,\taub_1,\omegab_1) \gets (\gammab-\lambda \nabla_{\gammab},\taub_1-\lambda \nabla_{\taub_1},\omegab_1-\lambda \nabla_{\omegab_1})$
\ENDFOR
\end{algorithmic}
\textbf{Output} Estimated parameters $(\taub_1, \omegab_1, \gammab)$.
\end{algorithm}

\begin{figure}[!b]
    \centering
    \begin{subfigure}[t]{\linewidth}
        \includegraphics[width=\linewidth]{figures/linear_loss.pdf}
        \caption{
            \small 
            \textbf{Comparison between learned and analytical solutions 1-cause TCR %$\taub_1$ and $\omegab_1$ parameters.
            }
            Average cosine similarity to the analytical solutions over 20 runs.
            Each run corresponds to one draw of adjacency matrix parameters.
            The shaded areas show the range between the minimum and maximum values.
            The dashed gray line corresponds to perfect similarity.
        }
        \label{fig:linear_loss}
    \end{subfigure}
    % \hfill % Optional: add some horizontal space
    \begin{subfigure}[t]{\linewidth}
        \vspace{\baselineskip}
        \includegraphics[width=\linewidth]{figures/two_branch_inset.pdf}
        \caption{\small \looseness-1
            \textbf{Two-branch linear model.}
            Learned $\tau$- and $\omega$ parameters for a TCR with two high-level variables for a linear Gaussian low-level model with $N{=}10$.
            The solid lines show the parameters for $Z_1$ and the dashed lines those for $Z_2$.
            The parameters are averaged over 20 runs where each run corresponds to one draw of adjacency matrix parameters.
            The inlay shows the causal structure of the low-level model, where two groups of variables $G_1$ and $G_2$ form two independent chains causing the target $X_{10}{=}Y$.
        }
        \label{fig:two_branch}
    \end{subfigure}
    \caption{\small \looseness-1
            \textbf{Toy example experiments.}
            }
    \label{fig:toy_experiments}
\end{figure}

\paragraph{Gaussian approximation of consistency loss.}
Since the KL divergence is challenging to compute in non-parametric settings, we make a Gaussian assumption on the densities. 
This allows us to obtain an analytic expression for the loss based on second order statistics (see expression in App.~\ref{app:gaussloss}).

\begin{figure*}[htb]
\centering
    \includegraphics[width=\linewidth]{figures/double_well_phase2.pdf}
	\caption{\small
	\textbf{Double well experiment.}
    % Comparison of samples and inferred causal factors in the double well experiment. 
    (a) Experimental setup with a ball moving in a double well potential subject to linear friction.
    (b) Pushforward density of the high-level cause for the two settings: one where no intervention is applied (unintervened), and the other with an applied shift intervention.
    (c, d) Learned parameters, $\tau$ and $\omega$, respectively.
    The learned high-level mechanism is $f(Z_1) \approx 1.37 Z_1 + 0.45$
    (e) Samples in phase space (position vs.\ velocity) for the first 20 time points.
    The color indicates whether the high-level model predicts the ball to end up in the right (pink) or right well (turquoise).
    (f, g) Samples from the unintervened setting and the corresponding estimated density. 
    (h) Estimated density for one intervened setting.
	\label{fig:double_well}
 }
\end{figure*}

\paragraph{Overlap loss.}  \looseness-1
To ensure differentiability of the reduction maps we do not implement the alignment $\pi$ explicitly, but encourage non-overlapping reduction maps via the regularizer
\begin{equation}
    \Lcal_\mathrm{ov}=\sum_{k<l}\left(
    \left\langle \frac{|\taub_k|}{\|\taub_k\|},\frac{|\taub_l|}{\|\taub_l\|}\right\rangle
    + \left\langle \frac{|\omegab_k|}{\|\omegab_k\|},\frac{|\omegab_l|}{\|\omegab_l\|}\right\rangle
    \right),
    \label{eqn:overlap}
\end{equation}
where $|\cdot|$ is the element-wise absolute value.

\paragraph{Balancing loss.}
Minimizing the Gaussian approximation of the consistency loss together with overlap regularization~\eqref{eqn:overlap} there is nothing preventing the solution from attributing all non-zero weights in the $\tau$- and $\omega$ maps to one high-level variable while ignoring all others.
In order to prevent such a collapse, we minimize stark differences between the high-level variables through the balancing term
\begin{equation}
    \Lcal_\mathrm{bal}=\Biggl( \
        \frac{\sqrt{\sum_k\| \alpha_k \taub_k \|_2^2}}{\sum_k\| \alpha_k \taub_k \|_2}
        + \frac{\sqrt{\sum_k\| \alpha_k \omegab_k \|_2^2}}{\sum_k\| \alpha_k \omegab_k \|_2}
    \ \Biggr) \, ,
    \label{eqn:balancing}
\end{equation}
where $\alpha_k$ is the coefficient in the linear high-level mechanism corresponding to variable $Z_k$.


Gathering the losses, we get the total objective
\begin{equation}
\underset{\gammab,\taub,\omegab}{\mbox{minimize }} \Lcal_\mathrm{tot} = \Lcal_\mathrm{cons}+\eta_\mathrm{ov}\Lcal_\mathrm{ov} +\eta_\mathrm{bal}\Lcal_\mathrm{bal}\,.
\label{eqn:total_loss}
\end{equation}
The learning procedure is described in Algorithm~\ref{alg:LCPR}.


\section{EXPERIMENTS}
\label{sec:experiments}

\subsection{Toy examples: linear Gaussian low-level causal models}
\label{sub:exp_toy_examples}


\begin{figure*}[htb]
\centering
    \includegraphics[width=\linewidth]{figures/mass_spring3.pdf}
	\caption{\small
	\textbf{Spring-mass system experiment.}
        (a) Simulated system of four point masses with different weights connected by springs and with random initial velocity (blue arrows).
        The target of the simulation is the center of mass speed in $(1, 1)$-direction.
        (b-e) Learned $\taub$- and $\omegab$-weights corresponding to velocity components in $x$- and $y$-direction for a TCR with two high-level variables.
        The learned high-level mechanism is $f(\Zb) \approx -0.226 Z_1 + 0.220 Z_2 $.
        (f) Comparison between masses and learned omega weights. For the first high-level variable the mean omega weights corresponding to the $x$-direction are shown and for the second variable those for the $y$-direction.
        (g-j) Example trajectory for an unintervened system.
        }
    \label{fig:spring_mass}
\end{figure*}

\paragraph{Linear low-level causal models.}
We first test TCR by sampling from a linear Gaussian low-level model, rather than a simulation.
We construct linear models of the form shown in Prop.~\ref{prop:analytic_solution} by drawing the non-zero entries in the adjacency matrix uniformly from the interval $[-1, 1]$.
We learn a targeted causal reduction with two high-level variables: the target $Y$ and its single cause $Z$.
Fig.~\ref{fig:linear_loss} compares the learned $\taub_1$ and $\omegab_1$ to the analytical solutions \eqref{eqn:analyitcal_tau} and \eqref{eqn:analyitcal_omega}.
We observe that, for these low-level models meeting the linear Gaussian assumption in Section~\ref{sec:theoretical_analysis}, the learning algorithm converges to the global optimum.

\paragraph{Two-branch model.}  \looseness-1
To investigate the behavior of TCR with multiple high-level variables, we consider a low-level model with two branch causal structure (Fig.~\ref{fig:two_branch}).
With regularization for overlap~\eqref{eqn:overlap} and balancing~\eqref{eqn:balancing}, the learned high-level variables correspond to the two branches.
Within each branch, the reduction behaves as described for the linear chain in Ex.~\ref{ex:linear_chain}, where $\tau$ focusses on the direct parent of the target and $\omega$ is spread across all variables in the chain.
Comprehensive experimental details are given in App.~\ref{app:experimental_details}.


\subsection{Double Well}
\looseness-1
For a simulation based on an ODE system, we learn a targeted reduction of a ball moving in a double well potential under linear friction, as shown in Fig.~\ref{fig:double_well}.
The state vector $\Xb$ encodes the $x$-position and velocity in $x$-direction of the ball at each time steps of the simulation.
As shift-interventions, we apply small random shifts of the ball's velocity at each simulation time step, mimicking an applied external force.
Initially, the ball starts on the left-hand side of the potential and starts oscillating. 
Since the ball experiences friction, it ends up in either the left or right minimum of the potential.
The friction is relatively strong, such that, depending on the initial conditions and applied shift interventions, the ball either stays in the left well or crosses the middle hump once and stays in the right well (see Fig.~\ref{fig:double_well}(f)).
We learn a simple TCR with a single cause $Z$ that explains the target $Y$.
Further details about the nonlinear ODE system and training are given in App.~\ref{app:double_well}.
\par \looseness-1
The learned TCR parameters are shown in Fig.~\ref{fig:double_well}(c, d).
The $\taub_1$ and $\omegab_1$ parameters for velocity are such that the larger the velocity is to the right, the higher $Z$ and therefore the higher the predicted target $Y$, where positive $Y$ correspond to the right well and negative to the left.
Similarly, for the position parameter: the more negative the position just before the critical point of the ball crossing the hump, the higher the probability of predicting to stay in the left well.
This corresponds to the correct dynamics of the system and also identifies the main drivers that influence the outcome $Y$.
Fig.~\ref{fig:double_well}(e) shows how the learned TCR separates the phase space into simulations with enough momentum to the right to make it over the hump (pink) and those without (turquoise).
\par \looseness-1
Note that TCR does not focus on the part of $\Xb$ which best predicts the final state of the system---like the position just before the end of the simulation.
It rather highlights the variables which have the most impact on the target when they are intervened on, emphasizing the decisive time span when the ball either crosses the middle hump or stays in the left well.

\subsection{Spring-Mass System}
\label{ssec:spring_mass_system}

We simulate a two-dimensional system of four point masses with different weights connected by springs to their respective nearest neighbors, similar to the motivating example introduced in Sec.~\ref{sec:introdcution}.
Initially, the masses are arranged in a rectangle in space such that the springs are at rest length.
The masses have a random initial velocity, as shown in Fig.~\ref{fig:spring_mass}(a).
As interventions, we apply random shifts to the velocities in $x$- and $y$-direction of each mass.
The target of the simulation is the center of mass speed in $(1,1)$-direction.
We learn a TCR with two high-level causes.
The full experimental details are given in App.~\ref{app:spring_mass_system}.
\par \looseness-1
While the velocities of the individual masses are coupled, the center of mass velocities in $x$- and $y$-direction of the system as a whole are independent, since the system is freely moving in space.
The learned TCR correctly identifies these as the two independent causes of the target, with variable $Z_1$ corresponding to the $y$-direction and $Z_2$ to the $x$-direction.
On average, each mass receives a similar shift in velocity through the applied interventions.
However, since the masses are different, the shifts correspond to different contributions to the momentum of the system as a whole impacting the target.
This is reflected in the relative weights of the learned maps being proportional to the weight of each point mass, as shown in Fig~\ref{fig:spring_mass}(f).
\par
A second experimental setting with two groups of interconnected masses is shown in App.~\ref{app:spring_mass_grouped}, demonstrating a TCR learning independent causes along the mass index.


\section{DISCUSSION}

We introduce a novel approach for understanding complex simulations by learning high-level causal explanations from low-level models.
Our Targeted Causal Reduction (TCR) framework leverages interventions to obtain simplified, high-level representations of the causes of a target phenomenon. 
We formulate the intervention-based consistency constraint as an information theoretic learning objective, which favors the most causally relevant explanations of the target. 
Under linearity and Gaussianity assumptions, we provide analytical solutions and study their uniqueness, which provides insights into TCR's governing principles. 
One key assumption to obtain identifiability is that the leaf node, the target, is observed.
However, this is to the best of our knowledge the first identifiability proof for a general class of CMR for which the high-level variables are continuous and partially unknown. 
Notably, the $n$-cause TCR provides a form of causal \textit{independent component analysis} akin to the work of \cite{liang2023causal} but in a non-invertible setting and with a one dimensional target.  
We provide an algorithm for linear TCR and show it can effectively uncover the key causal factors influencing a phenomenon of interest. 
We demonstrate TCR on both synthetic models and scientific simulations, highlighting its potential for addressing the challenges posed by increasingly complex systems in scientific research. 

While we develop a CMR framework to learn high-level explanations for simulations, the simulation itself does not have to be explicitly formulated as a causal model and 
the causal relationships between variables in $\Xb$ do not have to be known a priori. 
The only additional element needed to learn TCR is a notion of shift-interventions.
We think that most scientific simulations based on differential equations naturally allow for a reasonable notion of shift interventions. 

\paragraph{Limitations and future work.}
To foster interpretability and tractability, we made Gaussian approximations and used linear $\taub$ and $\omegab$ maps.
While this has clear benefits, this may be too limiting for some complex simulations, and future work should explore more flexible approaches. 
Additionally, our method relies on performing a large number of interventions in simulation runs, which represents an additional cost in the context of large-scale simulation.
How to make the algorithm scale to this setting is left to future work. 

\begin{acknowledgements}
We thank Sergio Hernan Garrido Mejia and Yuchen Zhu for insightful discussions.
\par
This publication was supported by the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ 01IS18039A) and by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) through the Machine Learning Cluster of Excellence (EXC 2064/1, project 390727645).
\end{acknowledgements}

{\small
\bibliography{references}
}
\input{supplement}
\end{document}