\title{Supplementary Material}
\maketitle

\appendix

\tableofcontents


\onecolumn

\section{Further methodological details}
We provide some further details and intuitions about our approach, starting by giving some insights it has with information theory.
\subsection{An information theory perspective on nested models test maximisation}\label{sec:IT_and_SNR}

In an information-theoretic framework~\citep{thomas2006elements}, the nested models residuals \eqref{eq:simple_population_loss} can be interpreted as measures of uncertainty—quantified by entropy—regarding \( Y \) given \( X \) and \( Z \). The proposed statistic then aims to maximise the conditional mutual information  
\[
I(X; \mathbf{w}^\top Y \mid Z) = H(\mathbf{w}^\top Y \mid Z) - H(\mathbf{w}^\top Y \mid Z, X),
\]
where \( H(\mathbf{w}^\top Y \mid Z) \) and \( H(\mathbf{w}^\top Y \mid Z, X) \) denote the corresponding conditional entropies. This aligns with the well-established connection between conditional mutual information, causality, and conditional independence~\citep{janzing2013quantifying}.  

Moreover, Proposition~\ref{prop:SNR_FI_equiv} establishes that, in the linear case with information bottleneck, maximising the signal-to-noise ratio (SNR) is equivalent to maximizing Fisher information. This implies that the proposed algorithm optimally distinguishes between the interventional distributions \( p(Y \mid do(X = x)) \) and \( p(Y \mid do(X = x + \delta x)) \), improving their separability under small interventions. This follows from the well established connection between Fisher information and the Kullback–Leibler divergence.

\begin{proposition}
Let \( P(Y \mid x) \) be a probability distribution over \( Y \) parameterised by \( x \in \mathbb{R}^d \). Consider a small perturbation \( \delta x \) such that \( P(Y \mid x + \delta x) \) remains close to \( P(Y \mid x) \). Then, the Kullback–Leibler divergence between these two distributions admits the following second-order expansion:
\[
D_{\mathrm{KL}}(P(Y \mid x) \,\|\, P(Y \mid x + \delta x))
= \frac{1}{2} \delta x^\top I(x) \delta x + O(\|\delta x\|^3),
\]
where \( I(x) \) is the \emph{Fisher information matrix}, given by
\[
I_{\mathbf{w}}(x) = \mathbb{E} \left[ U(x) U(x)^\top \right],
\]
with \( U(x) =  \nabla_x \log P(\mathbf{w}^\top Y \mid  X= x) \) denoting the score function.
\end{proposition}

The proof is provided in Appendix~\ref{subsec:FI_SNR}.  

This result formalises the intuition that our algorithm identifies a subspace that maximally separates distributions under infinitesimal intervention perturbations, enhancing their distinguishability.


\subsection{Partial correlation analysis as the maximisation of a conditional independence test statistic}\label{supp:pCCA}
We briefly outline how the partial Canonical Correlation Analysis (CCA) test, originally introduced by \citet{Rao1969}, can be interpreted within our framework. Specifically, we show that it can be viewed as the maximisation of a partial correlation test between $\mathbf{w}^\top Y$ and $X$ when adjusted for $Z$.

Let the population residuals after regressing out \( Z \) be defined as:
\begin{align*}
    R_x(\mathbf{v}) &= \mathbf{v}^\top X - \mathbb{E}[\mathbf{v}^\top X \mid Z], \\
    R_y(\mathbf{w}) &= \mathbf{w}^\top Y - \mathbb{E}[\mathbf{w}^\top Y \mid Z].
\end{align*}

Assuming a linear relationship between \( X \) and \( Y \), the conditional independence statistic can be expressed as:
\begin{align}\label{eq:population_loss_partialCCA}
    T_{\text{C}}(X, Y, Z; \mathbf{w}, \mathbf{v}) &= \text{artanh}(\text{corr}(R_x(\mathbf{v}), R_y(\mathbf{w}))).
\end{align}
Under the null hypothesis of conditional independence, and assuming that \( R_x \) and \( R_y \) are linearly related and follow a Gaussian distribution, it can be shown that \( T_C \) is asymptotically normally distributed. Since the \( \text{artanh} \) function is monotonic, maximising the CIT statistic is equivalent to maximizing the partial correlation test statistic.


It also share similarities to the statistic proposed in \citep{shah2018}, with the key difference being the normalisation used. While \citet{shah2018} proposed a statistic based on the covariance of the residuals normalised by the variance of their product, we normalise by the product of the variances of the residuals, resulting in a correlation coefficient. Although the approach in \citep{shah2018} is known to have power against alternatives under very weak assumptions—specifically, that the convergence rate of the estimators of the conditional expectations results in an error product rate of $o(n^{-1})$—it is less straightforward to derive explicit formulations for $\mathbf{v}$ and $\mathbf{w}$ from their method.


\paragraph{Empirical estimator for partial CCA} 

We are given two unbiased estimators $\hat{f}_x(Z)$ and $\hat{f}_y(Z)$ of respectively $\mathbb{E}[X|Z]$ and $\mathbb{E}[Y|Z]$. We denote by $ \hat{\mathbf{R}}_{x} $ and $ \hat{\mathbf{R}}_{y} $ the (empirical) residuals obtained from the predictions of $ \mathbf{X} $ and $ \mathbf{Y} $, respectively $\hat{\mathbf{R}}_{x} = \mathbf{X} - \hat{f}_x(\mathbf{Z})$ and $\hat{\mathbf{R}}_{y} = \mathbf{Y} - \hat{f}_y(\mathbf{Z})$. Similarly, the maximisation of the loss can be obtained via a generalised eigenvalue decomposition 
\begin{equation}\label{eq:pcorr_empirical_loss}
\hat{\mathbf{\Sigma}}_{\mathbf{R}_x\mathbf{R}_y}^\top \hat{\mathbf{\Sigma}}_{\mathbf{R}_y} ^{-1}\hat{\mathbf{\Sigma}}_{\mathbf{R}_x \mathbf{R}_y} \mathbf{w} = \lambda     \hat{\mathbf{\Sigma}}_{\mathbf{R}_y}\mathbf{w},
\end{equation}
where $\hat{\mathbf{\Sigma}}_{\mathbf{R}_x \mathbf{R}_y}$ is the sample covariance of $ \hat{\mathbf{R}}_{x} $ and $\hat{\mathbf{R}}_{y} $, and $\hat{\mathbf{\Sigma}}_{\mathbf{R}_y}$ is the sample covariance of $ \hat{\mathbf{R}}_{y} $.

\subsection{Empirical estimators}\label{sec:Estim_details}

We provide additional details regarding our estimators, focusing on three key aspects: the estimation of conditional expectations, the extraction of multiple components, and the stability of the solutions obtained through the Generalised Eigenvalue (GEV) problem.

\paragraph{Estimation of the Conditional Expectation} 
We have, so far, assumed the availability of unbiased estimators for the conditional expectations $\mathbb{E}[Y|X, Z]$ and $\mathbb{E}[Y|Z]$. In practice, these estimators should be selected based on domain-specific knowledge. In our case, we use the OLS estimator for the linear case and random forests for the nonlinear case.

\paragraph{Further Components}\label{sec:more_components}
Until now, we have considered the case where $ q = 1 $, assuming that the dimensionality of the direct effect of $ X $ on $ Y $ is rank one. In a manner analogous to the power iteration method \citep{Mises1929}, we can extract additional components by employing a deflation technique. 


\paragraph{Stability of the Solution}\label{sec:stability}
The stability of the solutions is influenced by the covariance matrices $ \hat{\boldsymbol{\Sigma}}_{\text{full}} $ and $ \hat{\boldsymbol{\Sigma}}_{R_y} $, which may be ill-conditioned due to the characteristics of the noise term $ N_Y $. This can complicate the GEV resolution. To mitigate this issue, we use a regularisation strategy that modifies the covariance matrices as $\hat{\boldsymbol{\Sigma}}_{\text{full}} + \delta \mathbf{I}, \quad \text{and} \quad \hat{\boldsymbol{\Sigma}}_{R_y} + \delta \mathbf{I}$, where $ \delta $ is a small constant (typically $ 10^{-8} $) that stabilises the smallest eigenvalues. 

Optimising the regularisation parameter more effectively might be crucial in the context of high-dimensional response variables. A promising approach could be the Ledoit-Wolf regularisation strategy, as proposed by \cite{ledoit2004well}.


\paragraph{Complexity}\label{sec:complexity}

The computational complexity of the Direct Effect Analysis algorithm scales with the number of components ($K$), the number of samples ($n$), and the dimensionality of the variables ($d$). Let the training complexities of the conditional expectation models $g_\text{res}$ and $g_\text{full}$ be $\mathcal{O}(w_\text{res}(n, d, r))$ and $\mathcal{O}(w_\text{full}(n, d, r))$, respectively. Then, the overall complexity is
\begin{align*}
    \mathcal{O}\left(K \left(w_\text{res}(n, d, r) + w_\text{full}(n, d, r) + nd^2 + Kd^3\right)\right).
\end{align*}

Here, the first two terms account for training the conditional expectation estimators, the third term corresponds to computing the residual covariance matrix, and the final term arises from solving the generalized eigenvalue (GEV) problem.





\subsection{Noise term behavior}\label{sec:noise_term_behavior}

The conditions outlined in Proposition \ref{prop:ntb} may initially seem complex, so we provide a more intuitive explanation here. In many practical scenarios, improving the signal-to-noise ratio becomes crucial as the dimensionality of $Y$ increases, which can occur when enhancing image resolution or adding sensors. Higher-dimensional data provides a richer representation of the system, enabling better separation of signal and noise and improving inference accuracy.

\paragraph{Example.} In climate science, we analyze global temperature patterns using climate observations. Let $Y$ represent the observed temperature field and $N_Y$ represent the observational errors arising from sensor limitations or model imperfections. The function $\phi(X)$ may capture internal climate variability (e.g., El Niño), while $\psi(Z)$ represents external forcing (e.g., greenhouse gas emissions). Increasing data granularity—through higher-resolution climate models, more sensors, or longer historical records—enhances the detectability of systematic climate responses while averaging out transient noise. As a result, the signal-to-noise ratio improves, making it easier to discern causal relationships and understand climate drivers.

The key insight is that algorithm performance depends on the structure of $\mathbf{b}$ and its interaction with noise terms. Performance improves when signal variance increases ($\|\mathbf{b}\|^2$ grows with $d$) in directions where noise covariances $\mathbf{\Sigma}$ and $\mathbf{\Sigma}_{\psi(Z)}$ are small. We can for example think about the simple case where the eigenvalue of the covariance matrix $\mathbf{\Sigma}_N$ decay quadractically as $d$ increases and where $\mathbf{b}=(1, \dots, 1)$. The estimator $T_D$ is optimal in that it maximises the signal-to-noise ratio under mild conditions (see Proposition \ref{prop:snr_max}). Convergence issues arise only in rare cases where $\mathbf{b}$, $\mathbf{\Sigma}$, and $\mathbf{\Sigma}_{\psi(Z)}$ decay at similar rates, as illustrated in Figure \ref{fig:DR_noise_behavior_noNoise}.

While these guarantees hold in the idealised population setting with infinite data, real-world applications often involve limited samples. In such cases, the theoretical insights may not directly translate into robust performance, necessitating regularisation techniques to prevent overfitting and improve estimation reliability in small datasets.



\section{Proofs}\label{sec:Proofs}

We now provide proof of our main theoretical results. As this will be useful for most of the theoretical development, we first get a result for the first eigenvector of each optimisation problem.

\subsection{Auxiliary lemma}

\begin{lemma}\label{lemma:EGV_sol}
Let $\mathbf{w}_S$, $\mathbf{w}_F$, and $\mathbf{w}_D$ denote the first eigenvectors associated with the optimisation problems in Eq. \eqref{eq:simple_population_loss}, Eq. \eqref{eq:population_loss_F}, and Eq. \eqref{eq:population_loss_detect}, respectively. The following properties hold:

\begin{enumerate}
    \item The eigenvector $\mathbf{w}_S$ is proportional $\mathbf{b}$, i.e., $\mathbf{w}_S \propto \mathbf{b}$, when maximizing Eq. \eqref{eq:simple_population_loss}. In the linear case it corresponds to the Gradient EDE.
    \item The eigenvector $\mathbf{w}_F$ is proportional to the direction of the inverse covariance-weighted true causal effect, i.e., $\mathbf{w}_F \propto \mathbf{\Sigma}^{-1} \mathbf{b}$, when maximizing Eq. \eqref{eq:population_loss_F}.
    \item The eigenvector $\mathbf{w}_D$ is proportional to the inverse of the sum of the covariances of the noise and confounding variables, i.e., $\mathbf{w}_D \propto (\mathbf{\Sigma} + \mathbf{\Sigma}_{\psi(Z)})^{-1} \mathbf{b}$, when maximizing Eq. \eqref{eq:population_loss_detect}.
\end{enumerate}
\end{lemma}


\begin{proof}
    Recall the definitions:
    \begin{align*}
        R_{\text{full}}^2(\mathbf{w}) &= \mathbb{E}[(\mathbf{w}^\top Y - \mathbb{E}[\mathbf{w}^\top Y | X, Z])^2], \\
        R_{\text{res}}^2(\mathbf{w}) &= \mathbb{E}[(\mathbf{w}^\top Y - \mathbb{E}[\mathbf{w}^\top Y | Z])^2], \\
        R_{\text{noise}}^2(\mathbf{w}) &= \mathbb{E}[(\mathbf{w}^\top Y - \mathbb{E}[\mathbf{w}^\top Y \mid X, Z=0])^2].
    \end{align*}

    From the model $Y^x = \mathbf{b}\phi(x) + \psi(Z) + N_y$, we derive:
    \begin{align*}
        R_{\text{full}}^2(\mathbf{w}) &= \mathbf{w}^\top \mathbf{\Sigma} \mathbf{w}, \\
        R_{\text{res}}^2(\mathbf{w}) &= \mathbf{w}^\top \mathbf{\Sigma} \mathbf{w} + \phi(x)^2 \mathbf{w}^\top \mathbf{b} \mathbf{b}^\top \mathbf{w}, \\
        R_{\text{noise}}^2(\mathbf{w}) &= \mathbf{w}^\top \mathbf{\Sigma} \mathbf{w} + \mathbf{w}^\top \mathbf{\Sigma}_{\psi(Z)} \mathbf{w}.
    \end{align*}

    The difference between the residual and full terms is:
    \[
        R_{\text{res}}^2(\mathbf{w}) - R_{\text{full}}^2(\mathbf{w}) = \phi(x)^2 \mathbf{w}^\top \mathbf{b} \mathbf{b}^\top \mathbf{w}.
    \]

    Each optimisation problem for $T_S$, $T_F$, and $T_D$ corresponds to a generalised eigenvalue problem of the form $\mathbf{N}^{-1}\mathbf{M}$, where $\mathbf{M} = \phi(x)^2 \mathbf{b} \mathbf{b}^\top$ is rank-1. Therefore, the first eigenvector $\mathbf{w}_1$ is proportional to $\mathbf{N}^{-1} \mathbf{b}$. The optimal solutions are:
    \begin{enumerate}
        \item $\mathbf{w}_S \propto \mathbf{b}$ for Eq. \eqref{eq:simple_population_loss},
        \item $\mathbf{w}_F \propto \mathbf{\Sigma}^{-1} \mathbf{b}$ for Eq. \eqref{eq:population_loss_F},
        \item $\mathbf{w}_D \propto (\mathbf{\Sigma} + \mathbf{\Sigma}_{\psi(Z)})^{-1} \mathbf{b}$ for Eq. \eqref{eq:population_loss_detect}.
    \end{enumerate}
\end{proof}

\subsection{Signal-to-Noise optimality }



\begin{proposition}[General optimality]
    Assuming $P$ is entailed in the SCM in Eq. \eqref{eq:scm}, we have that $\mathbf{w}_D$ is optimal.
\end{proposition}
\begin{proof}
    Recall that the optimal detector loss is defined as:
    \begin{align*}
        T_D &= \frac{R_{\text{res}}^2(\mathbf{w}) - R_{\text{full}}^2(\mathbf{w})}{R_{\text{noise}}^2(\mathbf{w})} = \frac{\phi(x)^2 \mathbf{w}^\top \mathbf{b} \mathbf{b}^\top \mathbf{w}}{\mathbf{w}^\top \mathbf{\Sigma} \mathbf{w} + \mathbf{w}^\top \mathbf{\Sigma}_{\psi(Z)} \mathbf{w}}.
    \end{align*}

    Using the results from \autoref{lemma:EGV_sol}, the signal-to-noise ratio is given by:
    \begin{align*}
        \gamma^2(\mathbf{w}) &= \frac{(\mathbf{w}^\top S(x))^2}{\mathbf{w}^\top \mathbf{\Sigma}_N \mathbf{w}} = \phi(x)^2 \frac{\mathbf{w}^\top \mathbf{b} \mathbf{b}^\top \mathbf{w}}{\mathbf{w}^\top (\mathbf{\Sigma} + \mathbf{\Sigma}_{\psi(Z)}) \mathbf{w}},
    \end{align*}
    where we note that $\psi(Z)$ and $N_y$ are assumed to be independent.

    This completes the proof.
\end{proof}

\begin{proposition}[Optimality under isotropic noise]
    Assuming that $P$ is entailed in the SCM in Eq. \eqref{eq:scm} and that $\mathbf{\Sigma}_N$ is isotropic, we have that both $\mathbf{w}_S$ and $\mathbf{w}_D$ are optimal. Moreover, if $\mathbf{\Sigma}_y$ is also isotropic, then $\mathbf{w}_F$ is also optimal.
\end{proposition}

\begin{proof}
    The proof follows straightforwardly from the assumption that $\mathbf{\Sigma}_N = \mathbf{\Sigma}_{\phi(z)} + \mathbf{\Sigma}$ is isotropic. Under this assumption, the constraint $\mathbf{w}^\top (\mathbf{\Sigma}_{\phi(z)} + \mathbf{\Sigma}) \mathbf{w}$ is equivalent to the constraint $\|\mathbf{w}\| = 1$. Therefore, the loss function $T_D$ simplifies to the form of $T_S$. As a consequence, the optimality of $\mathbf{w}_D$ stated in Prop. \ref{prop:snr_max} implies the optimality of $\mathbf{w}_S$.

    Similarly, when both $\mathbf{\Sigma}_N$ and $\mathbf{\Sigma}_y$ are isotropic, we observe that $T_F$ becomes equivalent to $T_S$. Since it has been established that if $\mathbf{\Sigma}_N$ is isotropic, $\mathbf{w}_S$ is optimal, we conclude that $\mathbf{w}_F$ is also optimal.
\end{proof}


\subsection{Noise term behavior}

\begin{proposition}[Noise Term Behavior]
    Let $\|\mathbf{b}\|^2 = o\left(\nu_1(d)\right)$, $\mathbf{b}^\top (\boldsymbol{\Sigma} + \boldsymbol{\Sigma}_{\psi(z)}) \mathbf{b} = o\left(\nu_2(d)\right)$, $\mathbf{b}^\top \boldsymbol{\Sigma}^{-1} \mathbf{b} = o\left(\nu_3(d)\right)$, $\mathbf{b}^\top (\boldsymbol{\Sigma}^{-1} + \boldsymbol{\Sigma}^{-1} \boldsymbol{\Sigma}_{\psi(\mathbf{Z})} \boldsymbol{\Sigma}^{-1}) \mathbf{b} = o\left(\nu_4(d)\right)$, and $\mathbf{b}^\top (\boldsymbol{\Sigma} + \boldsymbol{\Sigma}_{\psi(z)})^{-1} \mathbf{b} = o\left(\nu_5(d)\right)$.

    Assume that $\phi(X) \in \mathbb{R}$, the distribution $P$ follows the structural causal model in \ref{eq:scm}, and the following conditions hold:
    \begin{enumerate}
        \item $\lim_{d \to \infty} \frac{\nu_1(d)}{\nu_2(d)} \to \infty$,

        \item $\lim_{d \to \infty} \frac{\nu_3^2(d)}{\nu_4(d)} \to \infty$,

        \item $\lim_{d \to \infty} \nu_5(d) \to \infty$.
    \end{enumerate}

    Under these conditions, the following convergence properties hold:
    \begin{enumerate}
        \item $\gamma^2(\mathbf{w}_S) \to \infty$ if condition 1) holds,

        \item $\gamma^2(\mathbf{w}_F) \to \infty$ if condition 2) holds,

        \item $\gamma^2(\mathbf{w}_D) \to \infty$ if condition 1), 2), or 3) holds.
    \end{enumerate}
\end{proposition}


\begin{proof}
    Recalling that we have
    \begin{align}
        R_{\text{full}}^2(\mathbf{w}) &= \mathbf{w}^\top \mathbf{\Sigma}\mathbf{w}\\
        R_{\text{res}}^2(\mathbf{w}) - R_{\text{full}}^2(\mathbf{w}) &= \phi(x)^2  \mathbf{w}^\top \mathbf{b}\mathbf{b}^\top  \mathbf{w}\\
        R^2_{\text{noise}} &= \mathbf{w}^\top \mathbf{\Sigma}\mathbf{w} + \mathbf{w}^\top \mathbf{\Sigma}_{\psi(Z)} \mathbf{w}
    \end{align}

    

    Substituting these into the signal-to-noise ratio in Eq. \eqref{eq:SNR}, we obtain:
    \begin{align}
        \gamma^2(\mathbf{w}_S) &= \frac{\|\mathbf{b}\|^2 \phi(x)^2}{\mathbf{b}^\top\mathbf{\Sigma}\mathbf{b} + \mathbf{b}^\top\mathbf{\Sigma}_{\psi(Z)}\mathbf{b}}, \\
        \gamma^2(\mathbf{w}_F) &= \frac{(\mathbf{b}^\top\mathbf{\Sigma}^{-1}\mathbf{b})^2 \phi(x)^2}{\mathbf{b}^\top\mathbf{\Sigma}^{-1}\mathbf{b} + \mathbf{b}^\top \mathbf{\Sigma}^{-1}\mathbf{\Sigma}_{\psi(Z)} \mathbf{\Sigma}^{-1}\mathbf{b}}, \\
        \gamma^2(\mathbf{w}_D) &= \mathbf{b}^\top(\mathbf{\Sigma} + \mathbf{\Sigma}_{\psi(Z)})^{-1}\mathbf{b} \phi(x)^2.
    \end{align}

    The convergence properties of the signal-to-noise ratio follow directly from these formulations, assuming $\phi(x)$ is bounded.
    Since \( \mathbf{w}_D \) maximises the SNR, if either \( \mathbf{w}_S \) or \( \mathbf{w}_F \) has an SNR that grows to infinity, then the SNR of \( \mathbf{w}_D \) will also tend to infinity.
\end{proof}

\subsection{Equivalence of Signal-to-Noise ratio and Fisher information}\label{subsec:FI_SNR}

\begin{proposition}[Equivalence between Fisher Information and SNR]
    Consider a SCM as described in \eqref{eq:scm}, and let the intervention function be $\phi(x) = \mathbf{v}^\top x$, where $\mathbf{v} \in \mathbb{R}^d$. Then, the SNR is proportional to the Fisher Information of the intervention, i.e.   $I _{\mathbf{w}}(x) = \alpha \gamma^2(\mathbf{w})$ with $\alpha \in \mathbb{R}^+$.
\end{proposition}

\begin{proof}
    Let $\mathbf{w}^\top Y^x \sim \mathcal{N}(\mathbf{w}^\top \mu(x), \mathbf{w}^\top \mathbf{\Sigma}_N \mathbf{w})$ denote the distribution of $\mathbf{w}^\top Y^x$.

    The log-likelihood for the intervention is given by:
    \begin{align*}
        \log p(Y \mid do(X=x)) &= C - \frac{1}{2} \mathbf{w}^\top (Y - \mu(x))^\top (\mathbf{w}^\top (\mathbf{\Sigma}_{\psi(z)} + \mathbf{\Sigma}) \mathbf{w})^{-1} (Y - \mu(x)) \mathbf{w},
    \end{align*}
    where $C$ is a constant relative to $x$. 

    The \textit{informant} $U(x)$ is the derivative of the log-likelihood with respect to $x$:
    \begin{align*}
        \frac{\partial}{\partial x} \log p(Y \mid do(X=x)) &= \frac{1}{2} \mu'(x)^\top (\mathbf{w}^\top \mathbf{\Sigma} \mathbf{w})^{-1} (Y - \mu(x)) \mathbf{w}.
    \end{align*}

    The Fisher information $I_\mathbf{w}(x)$ is the variance of the informant. Since the informant at the maximum likelihood has mean zero \citep[see][section 6]{lehmann2006theory}, we write:
    \begin{align*}
        I_\mathbf{w}(x) &= \mathbb{E}[U(x) U(x)^\top] \\
        &= \mathbb{E} \left[ \mu'(x)^\top (\mathbf{w}^\top \mathbf{\Sigma} \mathbf{w})^{-1} (Y - \mu(x)) \mathbf{w} \mathbf{w}^\top (Y - \mu(x))^\top (\mathbf{w}^\top \mathbf{\Sigma} \mathbf{w})^{-1} \mu'(x) \right].
    \end{align*}
    Using the fact that $\mathbb{E}[(Y - \mu(x))(Y - \mu(x))^\top] = \mathbf{\Sigma}$, we obtain:
    \begin{align*}
        I_\mathbf{w}(x) &= \mathbf{w}^\top \mu'(x) (\mathbf{w}^\top \mathbf{\Sigma} \mathbf{w})^{-1} \mu'(x) \mathbf{w}.
    \end{align*}

    Since $\mu(x) = \mathbf{b} \mathbf{v}^\top x$, we have $\mu'(x) = \mathbf{b} \mathbf{v}$. Additionally, $\mathbf{\Sigma} = \mathbf{\Sigma}_{\psi(z)} + \mathbf{\Sigma}$. Substituting these expressions into the Fisher information formula, we get:
    \begin{align*}
        I_\mathbf{w}(x) &= \mathbf{w}^\top \mathbf{b} \mathbf{v}^\top (\mathbf{w}^\top (\mathbf{\Sigma}_{\psi(z)} + \mathbf{\Sigma}) \mathbf{w})^{-1} \mathbf{v} \mathbf{b}^\top \mathbf{w} \\
        &= \frac{\mathbf{w}^\top \mathbf{b} \mathbf{v}^\top \mathbf{v} \mathbf{b}^\top \mathbf{w}}{\mathbf{w}^\top (\mathbf{\Sigma}_{\psi(z)} + \mathbf{\Sigma}) \mathbf{w}} \\
        &= \frac{\|\mathbf{v}\|_2^2}{\phi(x)^2} \gamma^2(\mathbf{w}).
    \end{align*}
\end{proof}


\begin{proposition}
Let \( P(Y \mid x) \) be a probability distribution over \( Y \) parameterised by \( x \in \mathbb{R}^d \). Consider a small perturbation \( \delta x \) such that \( P(Y \mid x + \delta x) \) remains close to \( P(Y \mid x) \). Then, the Kullback–Leibler divergence between these two distributions admits the following second-order expansion:
\[
D_{\mathrm{KL}}(P(Y \mid x) \,\|\, P(Y \mid x + \delta x))
= \frac{1}{2} \delta x^\top I(x) \delta x + O(\|\delta x\|^3),
\]
where \( I(x) \) is the \emph{Fisher information matrix}, given by:
\begin{align*}
    I_{\mathbf{w}}(x) = \mathbb{E} \left[ U(x) U(x)^\top \right].
\end{align*}
With $U(x) =  \nabla_x \log P(\mathbf{w}^\top Y \mid  X= x)$ the informant (or score) function.
\end{proposition}




\begin{proof}[Proof sketch]
% 1. \textbf{First-Order Expansion of \( P(Y \mid x + \delta x) \):}  
   Assuming \( P(Y \mid x) \) is smooth in \( x \), we approximate it to its second order Taylor expansion:
   \[
   \log P(Y \mid x + \delta x) = \log P(Y \mid x) + \delta x^\top \nabla_x \log P(Y \mid x) + \frac{1}{2} \delta x^\top \nabla_x^2 \log P(Y \mid x) \delta x + O(\|\delta x\|^3).
   \]

% 2. \textbf{Taylor Expansion of the KL Divergence:}  
   The KL divergence is defined as:
   \[
   D_{\mathrm{KL}}(P(Y \mid x) \,\|\, P(Y \mid x + \delta x))
   = \mathbb{E}_{Y \sim P(Y \mid x)} \left[ \log \frac{P(Y \mid x)}{P(Y \mid x + \delta x)} \right].
   \]

% 3. \textbf{Computing the Expectation:}  
   Substituting into the KL divergence and using the property that \( \mathbb{E}_{Y \sim P(Y \mid x)} [\nabla_x \log P(Y \mid x)] = 0 \), the first-order term vanishes \citep[see][section 6]{lehmann2006theory}, leaving:
   \[
   D_{\mathrm{KL}}(P(Y \mid x) \,\|\, P(Y \mid x + \delta x))
   = -\frac{1}{2} \mathbb{E}_{Y \sim P(Y \mid x)} \left[ \delta x^\top \nabla_x^2 \log P(Y \mid x) \delta x \right] + O(\|\delta x\|^3).
   \]
   Since the Fisher information matrix is defined as \( I(x) = -\mathbb{E}[\nabla_x^2 \log P(Y \mid x)] \), we obtain:
   \[
   D_{\mathrm{KL}}(P(Y \mid x) \,\|\, P(Y \mid x + \delta x))
   = \frac{1}{2} \delta x^\top I(x) \delta x + O(\|\delta x\|^3).
   \]
\end{proof}


\subsection{Distribution of leading eigenvalues under conditional independence hypothesis}

% 
\begin{proposition}[Distribution of $\lambda_F$ under conditional independence]
    Let the distribution $P$ be induced by the SCM in \eqref{eq:scm} with linear assignments and Gaussian noise, and assume $p = q = 1$. Under the null hypothesis $H_0: X \indep Y \mid Z$, the largest root $\lambda_F$ is $F$-distributed such that $(dfn/dfd)\lambda_F\sim F(dfd, dfn)$ where $dfn = d$ and $dfd=n-p-r-1$.
\end{proposition} 

\begin{proof}[Proof sketch]
    It can easily be shown that $\hat{R}^2_{\text{full}}$ and $\hat{R}_{\text{res}}^2$ follows $\chi^2$ distributions of respectively $d(n - p - r -1)$ and $d(n - p - r -1)$ degrees of freedom as they are computed as sums of squared Gaussian distributions. Their ratio can thus be shown to follow an F distribution with degrees of freedom $dfn=d$ and $dfd=n- p-r-1$.
    As the weights related to $Z$ are frozen when getting $\hat{R}^2_{\text{noise}}$, we have that it follows a $\chi^2$ with $d(n  - p - 1)$ degrees of freedom. Thus $\Lambda_D \sim F(p, n-p-1)$.
\end{proof}

We refer reader to the distribution of Roy's largest root, the Chow test \citep{chow1960tests} or the generalised linear hypothesis test (see e.g. \cite{anderson1958} chapter 7) as similar problems have been widely studied in the multivariate statistics literature \citep{anderson2003introduction, Bilodeau1999TheoryOM}.

\begin{proposition}[Upper Bound on $\Lambda_D$ Under Conditional Independence]  
    Under similar assumptions as in Prop \ref{prop:lambda_F_distrib} we have under the null hypothesis $H_0: X \indep Y \mid Z$ that $P(\Lambda_D \geq \lambda_D | H_0) \leq P(\Lambda_F\geq\lambda_D |H_0)$.
\end{proposition} 

\begin{proof}[Proof sketch]
As the conditioning set in the computation of root squared errors $R^2_{\text{noise}}$ is larger than of $R^2_{\text{full}}$, the empirical residuals $R^2_{\text{noise}}$ are always larger than $R^2_{\text{full}}$ thus we have that 
\begin{align*}
    \frac{R^2_{\text{res}} - R^2_{\text{full}}}{R^2_{\text{noise}}} \leq  \frac{R^2_{\text{res}} - R^2_{\text{full}}}{R^2_{\text{full}}}.
\end{align*}
Hence, we have for any $\lambda_D$ that $P(\Lambda_D \geq \lambda_D | H_0) \leq P(\Lambda_F \geq \lambda_D|H_0)$ under the null hypothesis $H_0: X\indep Y |Z$.
\end{proof}

Note that by using this upper bound we tend to lose power in the test procedure but we still control type I errors (we reject less than we would optimally do) and thus the test is valid. Further research should aim at discovering a better approximation for the distribution $\Lambda_D$.

\subsection{Convergence rates}
We first introduce an important theorem that will be useful for the proof.

\begin{theorem}[Davis-Kahan theorem \citep{davis1970rotation}]
Let $ \lambda^{(1)} - \lambda^{(2)} = \delta > 0 $ where $ \lambda^{(1)} > \lambda^{(2)} \geq \dots \geq \lambda^{(d)} $ be the eigenvalues of $ \mathbf{\Sigma} $ and $ \hat{\lambda}^{(1)} - \hat{\lambda}^{(2)} = \hat{\delta} > 0 $ where $ \hat{\lambda}^{(1)} > \hat{\lambda}^{(2)} \geq \dots \geq \hat{\lambda}^{(d)} $ be the eigenvalues of $ \mathbf{\hat{\Sigma}} $ and let $\mathbf{W}$ and $\mathbf{\hat{W}}$ their corresponding eigenvectors. We have that 
\begin{equation}
    \|\sin \Theta(\mathbf{W}, \mathbf{\hat{W}})\|_{op} \leq \frac{\|\mathbf{\Sigma - \mathbf{\hat{\Sigma}}\|_{op}}}{\max_j(|\hat{\lambda}_{j-1} - \lambda_j|, |\hat{\lambda}_{j+1} - \lambda_j|)}
\end{equation}
where $\Theta$ is a distance between subspaces. Similarly, for any $j$ we have that $\|\mathbf{\hat{w}}_j - \mathbf{w}_j\| \leq \sqrt{2}\sin \Theta(\mathbf{w}_j, \mathbf{\hat{w}_j})$.
    
\end{theorem}



We show that under common assumptions, specifically that there are two unbiased estimators $ \hat{g}_{\text{full}} $ and $ \hat{g}_{\text{res}} $ with convergence rates $ \kappa_1(n) $ and $ \kappa_2(n) $, the estimators proposed in Eq. \eqref{eq:simple_population_loss} are consistent with their population counterparts. Furthermore, we demonstrate that their convergence rate typically depends on the convergence rates $ \kappa_1(n) $ and $ \kappa_2(n) $.

\begin{proposition}[Convergence Rate of F-Test Based Losses]\label{prop:cv}
    Assume the following conditions hold:
 \begin{enumerate}
        \item $\mathbb{E} \left\| \hat{g}_{\text{full}}(\mathbf{X}_i, \mathbf{Z}_i) - \mathbb{E}[\mathbf{Y}_i \mid \mathbf{X}_i, \mathbf{Z}_i] \right\|^2 = o_P(\kappa_1(n))$,
        \item $\mathbb{E} \left\| \hat{g}_{\text{res}}(\mathbf{Z}_i) - \mathbb{E}[\mathbf{Y}_i \mid \mathbf{Z}_i] \right\|^2 = o_P(\kappa_2(n))$,
        \item $\lambda^{M}_{1} - \lambda^{M}_{2} = \delta_M > 0$, where $\lambda^{M}_{1} > \lambda^{M}_{2} \geq \dots \geq \lambda^{M}_{d}$ are the eigenvalues of $\mathbf{M}$,
        \item $\lambda^{N}_{1} - \lambda^{N}_{2} = \delta_N > 0$, where $\lambda^{N}_{1} > \lambda^{N}_{2} \geq \dots \geq \lambda^{N}_{d}$ are the eigenvalues of $\mathbf{N}$,
        \item $\mathbb{E} \left\| Y - \mathbb{E}[Y \mid X, Z] \right\|^2 \leq N_{\text{full}}$ and $\mathbb{E} \left\| Y - \mathbb{E}[Y \mid Z] \right\|^2 \leq N_{\text{res}}$.
    \end{enumerate}
    


    Let $\mathbf{w}_1$ be the optimal solution to Eq. \eqref{eq:simple_population_loss}, Eq. \eqref{eq:population_loss_F}, or Eq. \eqref{eq:population_loss_detect}, and let $\hat{\mathbf{w}}$ be the empirical solution to their respective empirical estimators. Under the given conditions, we have the following convergence result:
    \begin{align}
        \mathbb{E} \left[\|\mathbf{w}_1 - \hat{\mathbf{w}}\|_2^2 \right] = o \left( \sqrt{\kappa_1(n)} + \sqrt{\kappa_2(n)} \right).
    \end{align}
\end{proposition}


\begin{proof}
    Similar to what was done for the empirical estimators, the population loss can be written as an eigenvalue decomposition problem $\mathbf{N}^{-1}\mathbf{M}$, where $\mathbf{M} = \mathbf{\Sigma}_{\text{res}} - \mathbf{\Sigma}_{\text{full}}$, and $\mathbf{N}$ depends on the loss used. For simplicity, we consider $\mathbf{N} = \mathbf{I}$, which leads to the convergence result for the simple loss in Eq. \eqref{eq:simple_population_loss}. A similar reasoning can be applied to the convergence of the two other losses.

    Let us first decompose $\mathbf{\hat{\Sigma}_{\text{res}}}$ as follows:
    \begin{align}
        \mathbf{\hat{\Sigma}_{\text{res}}} &= \frac{1}{n}\sum_{i=1}^n (\mathbf{Y_i} - \hat{g}_{res}(\mathbf{Z}_i))(\mathbf{Y_i} - \hat{g}_{res}(\mathbf{Z}_i))^\top\\
        &= \frac{1}{n}\sum_{i=1}^n \left(\mathbf{Y_i} - g_{res}(\mathbf{Z}_i) + g_{res}(\mathbf{Z}_i) - \hat{g}_{res}(\mathbf{Z}_i)\right)\left(\mathbf{Y_i} - g_{res}(\mathbf{Z}_i) + g_{res}(\mathbf{Z}_i) - \hat{g}_{res}(\mathbf{Z}_i)\right)^\top\\
        &= \frac{1}{n}\sum_{i=1}^n \mathbf{N}_{y,i}^\top \mathbf{N}_{y,i} + \frac{2}{n}\sum_{i=1}^n \mathbf{N}_{y,i}^\top \left(g_{res}(\mathbf{Z}_i) - \hat{g}_{res}(\mathbf{Z}_i)\right) + \frac{1}{n}\sum_{i=1}^n \left(g_{res}(\mathbf{Z}_i) - \hat{g}_{res}(\mathbf{Z}_i)\right)^\top \left(g_{res}(\mathbf{Z}_i) - \hat{g}_{res}(\mathbf{Z}_i)\right)
    \end{align}
    where $\mathbf{N}_{y,i}$ is the population residual (noise) of sample $i$.

    We now aim to bound $\|\mathbf{\Sigma} - \mathbf{\hat{\Sigma}}\|_F$. Using the previous notation, we have:
    \begin{align}
        \|\mathbf{\Sigma} - \mathbf{\hat{\Sigma}}\|_F &\leq \|\mathbf{\Sigma} - \frac{1}{n}\sum_{i=1}^n \mathbf{N}_{y,i}^\top \mathbf{N}_{y,i}\|_F + \frac{2}{n}\sum_{i=1}^n \|\mathbf{N}_{y,i}^\top (g_{res}(\mathbf{Z}_i) - \hat{g}_{res}(\mathbf{Z}_i))\|_F \\
        &+ \|\frac{1}{n}\sum_{i=1}^n (g_{res}(\mathbf{Z}_i) - \hat{g}_{res}(\mathbf{Z}_i))^\top (g_{res}(\mathbf{Z}_i) - \hat{g}_{res}(\mathbf{Z}_i))\|_F\\
        &\leq A + B + C.
    \end{align}

    We first handle the term $C$:
    \begin{align}
        \mathbb{E}[C] &\leq \frac{1}{n}\sum_{i=1}^n \mathbb{E}\left[\|(g_{res}(\mathbf{Z}_i) - \hat{g}_{res}(\mathbf{Z}_i))^\top (g_{res}(\mathbf{Z}_i) - \hat{g}_{res}(\mathbf{Z}_i))\|_F\right]\\
        &\leq \frac{1}{n}\sum_{i=1}^n \mathbb{E}\left[\|(g_{res}(\mathbf{Z}_i) - \hat{g}_{res}(\mathbf{Z}_i))\|_2^2\right]\\
        &\leq C_1 \kappa_1(n) \hspace{1cm}\text{by assumption $(1)$}.
    \end{align}

    Next, for the term $B$:
    \begin{align}
        \mathbb{E}[B] &\leq \frac{1}{n}\sum_{i=1}^n \mathbb{E}\left[ \|\mathbf{N}_{y,i}^\top (g_{res}(\mathbf{Z}_i) - \hat{g}_{res}(\mathbf{Z}_i))\|_F\right]\\
        &\leq \frac{1}{n}\sum_{i=1}^n \mathbb{E}\left[\|\mathbf{N}_{y,i}\|\right]\mathbb{E}\left[\|(g_{res}(\mathbf{Z}_i) - \hat{g}_{res}(\mathbf{Z}_i))\|_2\right]\\
        &\leq N_{res} \sqrt{C_1} \sqrt{\kappa_1(n)}\hspace{1cm}\text{by assumptions $(1)$ and $(5)$}.
    \end{align}

    For the term $A$, by the Strong Law of Large Numbers, there exists a constant $C_3$ such that:
    \[
    \mathbb{E}[A] \leq \frac{C_3}{\sqrt{n}}.
    \]
    Therefore, we obtain the bound:
    \[
    \mathbb{E}\left[\|\mathbf{\Sigma} - \mathbf{\hat{\Sigma}}\|_{op}\right] \leq N_{res} \sqrt{C_1} \sqrt{\kappa_1(n)} + C_1 \kappa_1(n) + \frac{C_3}{\sqrt{n}}.
    \]
    Similarly, an equivalent reasoning gives:
    \[
    \mathbb{E}\left[\|\mathbf{\Sigma} - \mathbf{\hat{\Sigma}}\|_{op}\right] \leq N_{full} \sqrt{C_2} \sqrt{\kappa_2(n)} + C_2 \kappa_2(n) + \frac{C_4}{\sqrt{n}} \hspace{1cm}\text{using assumptions $(2)$ and $(5)$}.
    \]

    Finally, applying the Davis-Kahan theorem, we have:
    \begin{align}
        \mathbb{E}\left[\|\mathbf{w}_1 - \mathbf{w}_S\|^2_2\right] &\leq \sqrt{2}\frac{\mathbb{E}\left[\|\mathbf{\Sigma}_{\text{res}} - \mathbf{\hat{\Sigma}}_{\text{res}}\|_F\right] + \mathbb{E}\left[\|\mathbf{\Sigma}_{\text{full}} - \mathbf{\hat{\Sigma}}_{\text{full}}\|_F\right]}{\delta_M}\\
        &\leq \sqrt{2}\frac{N_{full} \sqrt{C_2} \sqrt{\kappa_2(n)} + C_2 \kappa_2(n) + \frac{C_4}{\sqrt{n}} + N_{res} \sqrt{C_1} \sqrt{\kappa_1(n)} + C_1 \kappa_1(n) + \frac{C_3}{\sqrt{n}}}{\delta_M}\\
        &= o(\sqrt{\kappa_1(n)} + \sqrt{\kappa_2(n)}),
    \end{align}
    assuming that $\kappa_1(n)$ and $\kappa_2(n)$ decrease no faster than $o(1/n)$, which is typically the case for most of the regression algorithms. This conclude the proof.
\end{proof}


\newpage

\section{Experiments}

\subsection{Simulation experiments}\label{eq:data_generation}
The data are generated according to the following SCM:

\begin{align}
\begin{aligned}
    N_x, &N_z \sim \mathcal{N}(0, \mathbf{I}), \\
    N_y &\sim \mathcal{N}(0, \mathbf{\Sigma}), \\
    Z &:= N_z, \\
    X &:= f_a(\mathbf{C}^\top Z)  + N_x, \\
    Y &:= u \mathbf{b}^\top f_a(\Gamma^\top X) + v f_a(\mathbf{D}^\top Z) + w N_y.
\end{aligned}
\end{align}


\paragraph{Causal effect representation}

\begin{figure*}
    \centering
    \includegraphics[width=0.85\linewidth]{uai2025-template/figures/DR_noise_behavior_uq.png}
    \caption{Correlation between $\mathbf{w}^\top Y$ and $\phi(X)$ as $d$ increases. $T_D$ consistently outperforms all methods, recovering $\phi(X)$ as $d$ grows, provided that $\mathbf{b}$ faster than $\mathbf{\Sigma}$. Columns are indexed by as A, B, C, D and rows by $1, 2, 3, 4$. }
    \label{fig:DR_noise_behavior_Noise}
\end{figure*}

Comparing how the different learning algorithms behave in different noise contexts seems relevant. Primarily, we can observe in the setting Strong\_N\_Y low\_rank (Fig. \ref{fig:DR}) that the increase in performance using $T_F$ and $T_D$ is due to the low-rank structure of the noise. The overall better performance of $T_D$ over $T_F$ and pCCA is due to the correlation between $X$ and $Z$.
\begin{figure*}
    \centering
    \includegraphics[width=0.75\linewidth]{uai2025-template/figures/DR_uq.png}
    \caption{Experiments with different noise structure ($\mathbf{\Sigma}$ being diagonal, full rank and low rank) and scaling factors ($(u, v, w)$ as $(1/3, 1/3, 1/3)$, $(0.1, 0.1, 0.8)$, and $(0.1, 0.8, 0.1)$ for equal, Strong\_N\_Y and Strong\_Z). Overall, learning algorithm $T_D$ performs better and tends to converge.}
    \label{fig:DR}
\end{figure*}


To better understand the behaviour of our algorithms, we conducted experiments in two additional settings:

\begin{enumerate}
    \item \textbf{High-dimensional setting:} We conducted a similar experiment under different conditions on $\mathbf{b}$ and $\mathbf{\Sigma}$, increasing the dimensionality. In this case, however, we significantly reduced the sample size to $n=100$, such that as $d$ grows, we obtain $n<d$.
    \item \textbf{Nonlinear setting:} Again, we conducted a similar experiment with different conditions on $\mathbf{b}$ and $\mathbf{\Sigma}$. Still, here we applied a nonlinear mapping $f_a(z):= \exp(-z^2/2)\sin(az)$ with $a \in \{1, 2, 3\}$. We use a random forest algorithm with 100 trees as an estimator of the conditional expectation.
    \item \textbf{$X$ independent of $Z$ setting:} We conducted experiments to clarify the discrepancy of performance between pCCA and $T_D$ by generating the data such that $X$ and $Y$ are independent.
    \item \textbf{Non gaussian noises:} We conducted an experiment similar to the original experiment proposed in section \ref{sec:Experiments} but considering uniform and exponential noises.
\end{enumerate}


In the high-dimensional setting, as shown in Fig. \ref{fig:DR_high_dimensional}, we observe results that are very similar to those in the large-sample setting, with one key difference: when $\mathbf{\Sigma}$ increases rapidly with $d$ (row 1), and $d > n$, the model performance drops significantly to near zero and when $\mathbf{b}$ is growing to slowly compared to $\sigma$ ($\sigma=[1, 1/d]$ and $\mathbf{b}=[1/d, 1/d^2]$). It would be interesting to further evaluate how the regularisation parameter (see Section \ref{sec:stability}) might improve performance in this specific case. However, in this setting, ensuring algorithmic convergence is particularly challenging, as the signal strength is constrained by the number of available samples. As a result, the SNR is unlikely to grow unbounded unless the inherent noise in the data is minimal.




\begin{figure*}
    \centering
    \includegraphics[width=0.85\linewidth]{uai2025-template/figures/DR_noise_behavior_high_dim_uq.png}
    \caption{High dimensional experiment using $100$ samples for training. As observed in the last row, all algorithms fail to recover the signal when the noise variance increases rapidly and the number of samples falls below the outcome dimension.
}
    \label{fig:DR_high_dimensional}
\end{figure*}

Interestingly, in the nonlinear setting, the learning algorithm $T_D$ is still able to recover $\phi(X)$ in most cases. In contrast, other algorithms show greater difficulty in achieving convergence under these conditions. This highlights the potential of learning algorithm $T_D$ to recover direct effects, even in complex nonlinear settings effectively.

\begin{figure*}
    \centering
    \includegraphics[width=0.85\linewidth]{uai2025-template/figures/DR_NL_clean_renamed.jpg}
    \caption{Experiments with nonlinear map $f_a$ and with different noise structure ($\mathbf{\Sigma}$ being diagonal, full rank and low rank) and scaling factors ($(u, v, w)$ as $(1/3, 1/3, 1/3)$, $(0.1, 0.1, 0.8)$, and $(0.1, 0.8, 0.1)$ for equal, Strong\_N\_Y and Strong\_Z). Overall, the learning algorithm $T_D$ performs better and tends to converge.}
    \label{fig:DR_nonlinear}
\end{figure*}

We also conducted an experiment where $X$ and $Z$ were generated as independent variables to highlight the potential advantages of our learning algorithms. In this case, we observed that partial CCA (pCCA) could recover the latent structure similarly to the other algorithms. This result highlights the robustness of $T_D$, as it remains stable across different structural relationships between $X$ and $Y$, reinforcing its applicability in both confounded and mediated settings.


\begin{figure*}
    \centering
    \includegraphics[width=0.85\linewidth]{uai2025-template/figures/DR_noise_behavior_X_Z_indep.png}
    \caption{Experiment where $X \indep Z$. When the dependence between $X$ and $Z$ is removed, the pCCA algorithm performs similarly to $T_D$. This highlights the effectiveness of our approach in scenarios with confounding or mediation effects.}

    \label{fig:DR_noise_behavior_indep}
\end{figure*}

The experiments with both exponential and uniform noise, shown in Figure~\ref{fig:uniform_exponential}, demonstrate that the convergence behavior of the algorithms remains consistent across Gaussian, exponential, and uniform noise types. This aligns with our theoretical convergence guarantees presented in Section~\ref{sec:latent_structure_recovery}, which are noise-agnostic by design.

\begin{figure}[h]
    \centering
    \begin{minipage}{0.45\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2025-template/figures/DR_noise_behavior_uniform_uq.png}
    \end{minipage}
    \hfill
    \begin{minipage}{0.45\textwidth}
        \centering
        \includegraphics[width=\textwidth]{uai2025-template/figures/DR_noise_behavior_exponential_UQ.png}
    \end{minipage}
    \caption{Uniform noise (left) and exponential noise (right).}
    \label{fig:uniform_exponential}
\end{figure}


\paragraph{Hypothesis testing}

\begin{figure}
    \centering
    \includegraphics[width=0.9\linewidth]{uai2025-template/figures/Power_tests_all.png}
    \caption{Power of the different methods. $T_D$, $T_F$, and pCCA generally exhibit better performance compared to the other approaches. This is partly because they rely on linear Gaussian models, which constrain the alternative hypotheses, thereby improving the power of these tests.}

    \label{fig:power_all}
\end{figure}


\begin{figure}
    \centering
    \includegraphics[width=0.9\linewidth]{uai2025-template/figures/type_I_error_control_tests_all.png}
    \caption{Type $I$ error control. We can observe that all methods have a good control of type $I$ error except the Fisher $Z$ test, which has a poor control in low sample high dimensional settings.}
    \label{fig:type_I_control}
\end{figure}




For a test to be valid, it must control the Type I error rate. Specifically, if we test at level $\alpha$, then under the null hypothesis $H_0$, the probability of rejecting $H_0$ should be less than or equal to $\alpha$. In Fig. \ref{fig:type_I_control}, we observe that all the methods control the Type I error reasonably accurately.


\vfill

\newpage


\subsection{Real-World experiments}

\paragraph{Separating internal climate variability from the externally forced response.}

\begin{figure}[ht]
    \centering
    \begin{subfigure}{0.48\linewidth}
        \centering
        \includegraphics[width=\linewidth]{uai2025-template/figures/trends_mse_forced.png}
        \caption{Forced response trends}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.48\linewidth}
        \centering
        \includegraphics[width=\linewidth]{uai2025-template/figures/trends_mse_internal.png}
        \caption{Internal variability trends}
    \end{subfigure}
    \caption{Mean Squared Error (MSE) of different algorithms in reconstructing (a) internal climate variability trends and (b) forced response trends. The Direct Effect Analysis (DEA) algorithm, using the basis $(\mathbf{\Sigma}^{-1}\mathbf{b}, \mathbf{b}^\perp)$, i.e. $T_D$ algorithm, is compared to Detrending and Dynamical Adjustment (two most common approaches for separating internal from external climate variability). Overall, DEA and Detrending perform better. DEA outperforms Detrending for internal variability trend estimation but has a higher median MSE for forced trend reconstruction. However, DEA provides better worst-case control in this case.}
    \label{fig:mse_boxplot}
\end{figure}


This experiment aims to assess the performance of our learning algorithms in disentangling internal climate variability from the forced response to external factors, such as greenhouse gas (GHG) emissions or solar radiation. For this analysis, we focus on temperature fields. We use $M = 50$ members from the CESM2 historical climate simulations \citep{danabasoglu2020community}, covering 1880 to 2014. The variables under consideration are Sea Level Pressure (SLP) and Temperature (T), with monthly data yielding $1669$ samples per member. The detrended SLP field is treated as a proxy for internal variability ($Z \in \mathbb{R}^{648}$). In contrast, the temperature field ($Y \in \mathbb{R}^{648}$) serves as the response variable of interest. The temperature for member $i$, at location $j$, and time $t$, is denoted by $Y^{(i)}_j(t)$.

As a proxy for climate external forcing, we use a smoothed version (5-year moving average) of the Global Mean Temperature (GMT), which is computed as a spatial average of the temperature field ($X \in \mathbb{R}$):
\begin{equation}
    X(t) = \frac{1}{\text{years} \times 12} \sum_{\tau=1}^{\text{years} \times 12} \frac{1}{d} \sum_{j=1}^d Y_j(t - \tau)
\end{equation}
where $d$ represents the number of spatial locations.

The climate-forced response, $Y_{\text{forced}}$, is calculated as the ensemble mean over all simulation members, $Y^{(i)}$:
\begin{equation}\label{eq:intern_forced}
    Y_{\text{forced}, j}(t) = \frac{1}{M} \sum_{i=1}^M Y^{(i)}_j(t) \quad \text{and} \quad Y_{\text{internal}, j}^{(i)}(t) = Y^{(i)}_j(t) - Y_{\text{forced}, j}(t),
\end{equation}
where $Y_{\text{internal}}$ represents the true internal variability of $Y$ after removing the climate-forced component.

The Direct Effect (DEA) algorithm (employing $T_D$) is applied as follows: We train DEA using the triplet $\{GMT(t), T(t), SLP(t)\}_{t=1}^{years\times 12}$ as realisations of $(X, Y, Z)$, where $GMT$ serves as the climate external forcing proxy ($X$), $T$ represents the temperature response variable, and $SLP$ is a proxy for the internal climate variability ($Z$). 
Once the model is trained, we project the data onto the null space of the vector $\mathbf{b}$, denoted as $\mathbf{b}^\perp$, to recover the internal variability component $\hat{Y}_{\text{internal}}$. This projection isolates the portion of the temperature field that is not correlated with the external forcing, allowing us to separate the forced and internal components effectively. Finally, we compute the climate-forced response as $\hat{Y}_{\text{forced}} = Y - \hat{Y}_{\text{internal}}$, which provides an estimate of the temperature response attributed to external forcing alone.


We compare our learning algorithm with two common approaches used in climate science:

\begin{enumerate}
    \item \textbf{Detrending:} A simple linear model predicts $Y$ from $X$, providing an estimate of the climate-forced response, $\hat{Y}_{\text{forced}}$, and the dynamical component as $\hat{Y}_{\text{internal}} = Y - \hat{Y}_{\text{forced}}$.
    \item \textbf{Dynamical Adjustment} \citep{Sippel2019}: A model is trained using both $X$ (GMT) and $Z$ (SLP). Predictions are made by setting $Z$ to zero, isolating the dynamical component.
\end{enumerate}

We compute trends for the dynamical components ($Y_{\text{internal}}$, $\hat{Y}_{\text{internal}}$, $Y_{\text{forced}}$, and $\hat{Y}_{\text{forced}}$) over 20-year periods. The performance of the methods is evaluated using the following metrics:

\begin{itemize}
    \item \textbf{20-year trends MSEs:} MSEs for the trends of the three methods are shown in Fig. \ref{fig:mse_boxplot} (a) for forced trends and (b) for internal variability trends. The boxplots display the MSE distributions across different simulation members.
    \item \textbf{20-year trends maps (internal variability):} Internal variability trends are compared spatially in Fig. \ref{fig:trends_maps_DEA} for DEA and Fig. \ref{fig:trends_maps_Detrending} for Detrending to better understand model biases.
    \item \textbf{20-year trends time series (forced response):} Forced response trends are compared over time, with time series for DEA and Detrending plotted for randomly selected locations in Fig. \ref{fig:climate_experiment_forced_response_trends_TS}.
\end{itemize}

We train the algorithms (DEA, Detrending, Dynamical Adjustment), extract latent structures $\hat{Y}_{\text{forced}} = \mathbf{w}^\top Y$, and compare the trends of the last 20 years of $\hat{Y}_{\text{forced}}$ and $Y_{\text{forced}}$. Figure \ref{fig:mse_boxplot} shows that DEA performs similarly to Detrending but outperforms Dynamical Adjustment.

\begin{figure}
    \centering
    \includegraphics[width=1\linewidth]{uai2025-template/figures/estimated_trends_DEA.png}
    \caption{Trends of the reconstructed internal climate variability using the DEA algorithm. The algorithm captures general warming and cooling patterns but underestimates trends in the North Pole and overestimates them in Western America.}
    \label{fig:trends_maps_DEA}
\end{figure}

\begin{figure}
    \centering
    \includegraphics[width=1\linewidth]{uai2025-template/figures/estimated_trends_Detrending.png}
    \caption{Trends of the reconstructed internal climate variability using the Detrending algorithm. The algorithm captures general warming and cooling patterns but underestimates trends in the poles and overestimates trends in Western America and Indonesia.}
    \label{fig:trends_maps_Detrending}
\end{figure}

A qualitative evaluation of the trend maps generated by DEA (Figure \ref{fig:trends_maps_DEA}) shows that the algorithm captures the warming and cooling patterns. However, both DEA and Detrending tend to underestimate trends in polar regions where temperature trends are generally stronger.

\begin{figure}
    \centering
    \includegraphics[width=0.95\linewidth]{uai2025-template/figures/trends_temporal_DEA_vs_Detrending.png}
    \caption{Comparison of original observations $Y$ and the reconstructed climate-forced response $\hat{Y}_{\text{forced}}$ at 16 randomly selected locations for both DEA and Detrending.}
    \label{fig:climate_experiment_forced_response_trends_TS}
\end{figure}

Figure~\ref{fig:climate_experiment_forced_response_trends_TS} shows that both Detrending and DEA effectively capture the forced response trends. However, in regions where the forced response exhibits high variability (e.g., at $d=30$ or $d=493$, typically located in polar regions), both methods struggle to fully capture this variability.
 This may be due to the smoothing of GMT in the external forcing, but this observation warrants further investigation, as these regions of high variability may also reflect model artefacts. Further exploration of these phenomena is needed.
