\documentclass[twoside]{article}

\usepackage{aistats2025}
% If your paper is accepted, change the options for the package
% aistats2025 as follows:
%
%\usepackage[accepted]{aistats2025}
%
% This option will print headings for the title of your paper and
% headings for the authors names, plus a copyright note at the end of
% the first column of the first page.

% If you set papersize explicitly, activate the following three lines:
%\special{papersize = 8.5in, 11in}
%\setlength{\pdfpageheight}{11in}
%\setlength{\pdfpagewidth}{8.5in}

% If you use natbib package, activate the following three lines:
\usepackage[round]{natbib}
\renewcommand{\bibname}{References}
\renewcommand{\bibsection}{\subsubsection*{\bibname}}

% If you use BibTeX in apalike style, activate the following line:
%\bibliographystyle{apalike}

\usepackage{graphicx}
\usepackage{amsfonts}
\usepackage{amsmath,bm}
\usepackage{amsthm}
\usepackage{cleveref}

\newcommand{\YIZbdX}{{\hspace{-.5pt}Y \hspace{-.5pt} | \hspace{-.5pt} \bm{Z}\hspace{-1pt}, \hspace{-0.5pt} \text{do}(X)}}

% \bibliographystyle{apalike}
% If your paper is accepted, change the options for the package
% aistats2022 as follows:
%
%\usepackage[accepted]{aistats2024}
%
% This option will print headings for the title of your paper and
% headings for the authors names, plus a copyright note at the end of
% the first column of the first page.

% If you set papersize explicitly, activate the following three lines:
%\special{papersize = 8.5in, 11in}
%\setlength{\pdfpageheight}{11in}
%\setlength{\pdfpagewidth}{8.5in}

% If you use natbib package, activate the following three lines:
\renewcommand{\bibname}{References}
\renewcommand{\bibsection}{\subsubsection*{\bibname}}

% If you use BibTeX in apalike style, activate the following line:
\bibliographystyle{apalike}

\newtheorem{theorem}{Theorem}
\newtheorem{definition}{Definition}
\newtheorem{lemma}[theorem]{Lemma}

\begin{document}
% If your paper is accepted and the title of your paper is very long,
% the style will print as headings an error message. Use the following
% command to supply a shorter title of your paper so that it can be
% used as headings.
%
%\runningtitle{I use this title instead because the last one was very long}

% If your paper is accepted and the number of authors is large, the
% style will print as headings an error message. Use the following
% command to supply a shorter version of the authors names so that
% they can be used as headings (for example, use only the surnames)
%
%\runningauthor{Surname 1, Surname 2, Surname 3, ...., Surname n}

% Supplementary material: To improve readability, you must use a single-column format for the supplementary material.

\onecolumn
\aistatstitle{Instructions for Paper Submissions to AISTATS 2025: \\
Supplementary Materials}

\section{COPULA BACKGROUND}\label{app:copulas}
% \vspace{-1cm}

Copulas present a powerful tool to model joint dependencies independent of the univariate margins. This aligns well with the requirements of the Frugal Parameterisation, where dependencies need to be varied without altering specified margins (the most critical being the specified causal effect). Understanding the constraints and limitations of copula models ensures that causal models remain accurate and consistent with the intended parameterisation.

\subsection{SKLAR'S THEOREM}
Sklar's theorem \citep{sklar1959,czado2019analyzing} provides the fundamental foundation for copula modelling by providing a bridge between multivariate joint distributions and their univariate margins. It allows one to separate the marginal behaviour of each variable from their joint dependence structure, with the latter being the copula itself.

\begin{theorem}
For a d-variate distribution function $F_{1:d} \in \mathcal{F}(F_1,\ldots,F_d)$, with $j^{\text{th}}$ univariate margin $F_j$, the copula associated with $F$ is a distribution function $C : [0,1]^d \rightarrow[0,1]$ with uniform margins on $(0,1)$ that satisfies
\begin{equation*}
    F_{1:d}(\bm{y}) = C(F_1(y_1),\dots,F_{d}(y_d)), \bm{y} \in \mathbf{R}^{d}.
\end{equation*}
\begin{enumerate}
    \item If F is a continuous d-variate distribution function with univariate margins $F_1,\dots, F_d$ and rank functions $F^{-1}_1,\dots, F^{-1}_d$ then
    \begin{equation*}
        C(\bm{u}) = F_{1:d}(F^{-1}_1(u_1),\dots,F^{-1}_d(u_d)), \bm{u}\in[0,1]^d.
    \end{equation*}
    \item If $F_{1:d}$ is a d-variate distribution function of discrete random variables (more generally, partly continuous and partly discrete), then the copula is unique only on the set
    \begin{equation*}
        Range(F_1) \times \dots \times Range(F_d).
    \end{equation*}
\end{enumerate}
The copula distribution is associated with its density $c(\cdot)$
\begin{equation*}
    f(\bm{y}) = c(F_1(y_1),\dots, F_d(y_d))\cdot f_1(y_1)\dots f_d(y_d)
\end{equation*}
where $f_i(\cdot)$ is the univariate density function of the $i^{\text{th}}$ variable.
\end{theorem}

Note that Sklar's theorem explicitly refers to the \textbf{univariate marginals} of the variable set $\{Y_1,\dots, Y_d\}$ to convert between the joint of univariate margins $C(\bm{u})$ and the original distribution $F(\bm{y})$. For absolutely continuous random variables, the copula function $C$ is unique. This uniqueness no longer holds for discrete variables, but this does not severely limit the applicability of copulas to simulating from discrete distributions.

An equivalent definition (from an analytical purview) is $C: [0, 1]^d \rightarrow [0, 1]$ is a $d$-dimensional copula if it has the following properties: 
\begin{enumerate}
    \item $C(u_1,\dots, 0, \dots, u_d) = 0$
    \item $C(1, \dots, 1, u_i, 1, \dots, 1) = u_i$.
    \item $C$ is $d$-non-decreasing.
\end{enumerate}
\begin{definition}
    A copula $C$ is $d$-non-decreasing if, for any hyperrectangle $H=\prod_{i=1}^{d}\left[u_i, y_i \right]\subseteq [0,1]^{d}$, the $C$-volume of $H$ is non-negative.
    \begin{equation*}
        \int_{H}C(\bm{u})~d\bm{u} \geq 0
    \end{equation*}
\end{definition}

\pagebreak  %%%%%%%%%%%%%%%%%% NEED TO KEEP THIS IN ORDER FOR SUPP MATERIAL TO BE RENDERED WELL.

\subsection{COPULAS FOR DISCRETE VARIABLES}\label{appsub:discrete-copulas}

\subsubsection{CHALLENGES AND MOTIVATIONS}\label{subsubsec:discrete-copula}
Modelling the dependency between discrete and mixed data is particularly challenging as copulas for discrete variables are not unique. Additionally, copulas encode a degree of ordering in the joint as probability integral transforms are inherently ranked, and hence should only be used for count or ordinal data models. We use the approach suggested by \citet{ruschendorf2009distributional}. An outline of this method is presented in \Cref{appsub:distribtional-transform}. 

\subsubsection{EMPIRICAL COPULA PROCESSES FOR DISCRETE VARIABLES}\label{appsub:distribtional-transform}
In order to deal with discrete variables, we use a the Generalised Distributional Transform of a random variable found originally proposed by \citet{ruschendorf2009distributional}. We quote the main result from \citet{ruschendorf2009distributional} below. 

\begin{theorem}
On a probability space $(\Omega, \mathcal{A}, P)$ let $X$ be a real random variable with distribution function $F$ and let $V \sim U(0, 1)$ be uniformly distributed on $(0, 1)$ and independent of $X$. The \textit{modified distribution function} $F(x, \lambda)$ is defined by
\begin{equation*}
F(x, \lambda) := P(X < x) + \lambda P(X = x).
\end{equation*}
We define the (generalised) \textit{distributional transform} of $X$ by
\begin{equation*}
U := F(X, V).
\end{equation*}
An equivalent representation of the distributional transform is
\begin{equation*}
U = F(X-) + V(F(X) - F(X-)).
\end{equation*}
\end{theorem}
\citet{ruschendorf2009distributional} makes a key remark about the generalised transform's lack of uniqueness for discrete variables. Such a dequantisation step may introduce artificial local dependence which may lead to an incorrect flow being inferred, and therefore hinder the inference of the causal margin.

\section{GAUSSIAN COPULA WITH GAUSSIAN MARGINS}
\label{sec:gaussian}
% In the main text, we generate synthetic data from a Gaussian copula with univariate Gaussian margins. The resultant multivariate density is a multivariate Gaussian. Hence any univariate density conditioned on all the other variables will be a linear Gaussian. The proof for the latter can be found in \citet{bishop2006pattern}. The proof for the former is found below.

% \begin{theorem}
%     Let $\{Y_d\}_{d=1}^{D}$ be a set of $D$ univariate Gaussian random variables, where each $Y_d \sim \mathcal{N}(\mu_d, \sigma_d^2)$. Let $c\left(F_1(y_1), \dots, F_D(y_D)\right)$ denote a multivariate Gaussian copula, parameterized by a correlation matrix $\bm{R}$. The joint distribution of the random vector $\bm{Y} = (Y_1, Y_2, \dots, Y_D)^{T}$ is multivariate normal, specifically:
%     \[
%         \bm{Y} \sim \mathcal{N}(\bm{\mu},~ \Sigma \bm{R} \Sigma),
%     \]
%     where $\bm{\mu} = (\mu_1, \mu_2, \dots, \mu_D)^{T}$ is the mean vector, and $\Sigma$ is a $D \times D$ diagonal matrix, with $\Sigma_{ii} = \sigma_i$ for $i = 1, \dots, D$ and $\Sigma_{ij} = 0$ for $i \neq j$.
% \end{theorem}

% \begin{proof}
%     Consider the Gaussian copula with univariate marginally Gaussian variables $\{Y_1,\dots,Y_D\}$. Let $C(\bm{u}) = \bm{\Phi}_{D}(\Phi^{-1}(u_1),\dots, \Phi^{-1}(u_D))$, where we denote $\Phi(\cdot)$ as a standard Gaussian CDF, and $\phi_{D}(\cdot \mid \bm{\mu}, \bm{\Sigma})$ as a $D$-dimensional Gaussian PDF parametered by mean $\bm{\mu}$ and covariance $\bm{\Sigma}$. The multivariate density $f(\bm{y})$ is equal to 
%     \begin{align*}
%         f(\bm{y}) &= \nabla_{\bm{y}} C(F_1(y_1),\dots,F_D(y_D)) \\
%         &= \frac{\partial C(F_1(y_1),\dots,F_D(y_D))}{\partial F_1(y_1)\dots\partial F_D(y_d)}\cdot \bigg| \frac{\partial(F_1(y_1),\dots,F_D(y_D))}{\partial (y_1, \dots y_d)} \bigg|\\
%         &= c(F_1(y_1),\dots,F_D(y_D)) \times \prod_{d=1}^{D} f_d(y_d)
%     \end{align*}
%     where each variable $Y_{d} \sim \mathcal{N}(\mu_{d}, \sigma^{2}_{d})$.
    
%     The Gaussian copula distribution function (parameterised by correlation matrix $\bm{R}$) is equal to
%     \begin{equation}
%         C(\bm{u}) = \Phi_{D}(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_D)~|~\bm{0}, \bm{R})
%     \end{equation}
%     and thus the copula density is equal to
%     \begin{align}
%         c(\bm{u}) &= \nabla_{\bm{u}} \Phi_{D}(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_D)~|~\bm{0}, \bm{R}) \\
%         &= \phi_{D}(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_D)~|~\bm{0}, \bm{R}) \times \bigg| \frac{\partial(\Phi^{-1}(u_1),\dots,\Phi^{-1}(u_1))} {\partial (u_1, \dots u_d)} \bigg|\\
%         &=\frac{\phi_{D}(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_D)~|~\bm{0}, \bm{R})}{\prod_{d=1}^{D}\phi(\Phi^{1}(u))} \\
%         &= \phi_{D}(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_D)~|~\bm{0}, (\bm{R}^{-1} - \bm{I})^{-1})
%     \end{align}
%     as $\partial_{u}\Phi^{-1}(u) = \phi(\Phi^{-1}(u))^{-1}$.
    
%     In the following section, we use the following.
%     \begin{lemma}\label{lemma:gaussian-cdf}
%         Let $\Phi^{-1}(\cdot)$ be the inverse CDF of a standard Gaussian. For a generic Gaussian random variable, $X \sim \mathcal{N}(\mu, \sigma^{2})$, then:
%         \begin{equation}
%             \Phi^{-1}(F_{X}(x)) = \frac{x - \mu}{\sigma}.
%         \end{equation}
%     \end{lemma}
%     \begin{proof}
%     By expanding the CDF $F_{X}$ and rewriting the variable $X$ in terms of a scaled and translated standard Gaussian $W$, we get the following.
%     \begin{align*}
%         \Phi^{-1}(F_{X}(x)) &= \Phi^{-1}(P(X<x))\\
%         &= \Phi^{-1}(P(\sigma_{X} W + \mu<x)~\text{where}~W\sim \mathcal{N}(0, 1) \\
%         &= \Phi^{-1}\left(P\left(W <\frac{x -  \mu}{\sigma}\right)\right)\\
%         &= \Phi^{-1}\left(\Phi\left(\frac{x -  \mu}{\sigma}\right)\right)\\
%         &= \frac{x -  \mu}{\sigma}.
%     \end{align*}
%     \end{proof}
%     From \Cref{lemma:gaussian-cdf}, if $u_i=F_i(Y_i)$ where $Y_i \sim\mathcal{N}(\mu_i, \sigma_i^2)$, then $\Phi^{-1}(F_d(Y_d)) = \frac{Y_d - \mu_d}{\sigma_d}$. Define $\Omega$ as a diagonal matrix where $\Omega_{ii} = \sigma_i$ and $\bm{\mu}$ as a diagonal of marginal means. The multivariate joint is therefore
%     \begin{align*}
%         f(\bm{y}) &= \phi_{D}(Y_1, \dots, Y_D~|~\bm{\mu},~\Omega(\bm{R}^{-1} - \bm{I})^{-1}\Omega) \times \prod_{d=1}^{D} f_d(y_d) \\
%         &= \phi_{D}(Y_1, \dots, Y_D~|~\bm{\mu},~\Omega(\bm{R}^{-1} - \bm{I})^{-1}\Omega) \times \phi_{D}(Y_1, \dots, Y_D~|~\bm{\mu}, ~\Omega^{\intercal}\Omega) \\ 
%         &= \phi_{D}(Y_1, \dots, Y_D~|~\bm{\mu}, ~\Omega\bm{R}\Omega)
%     \end{align*}
%     which we recognise as a multivariate Gaussian with variances equal to $\Omega_{ii}^{2}$ and correlation matrix $\bm{R}$. 
% \end{proof}
In the main text, we generate synthetic data from a Gaussian copula with univariate Gaussian margins. The resultant joint multivariate density is a multivariate Gaussian. Consequently, any univariate density conditioned on all the other variables will be Gaussian, and the conditional mean is a linear function of the conditioning variables. The proof for the latter can be found in \citet{bishop2006pattern}. The proof for the former is provided below.

\begin{theorem}
    Let $\{Y_d\}_{d=1}^{D}$ be a set of $D$ univariate Gaussian random variables, where each $Y_d \sim \mathcal{N}(\mu_d, \sigma_d^2)$. Let $c(F_1(y_1), \dots, F_D(y_D))$ denote a multivariate Gaussian copula parameterized by a correlation matrix $\bm{R}$. The joint distribution of the random vector $\bm{Y} = (Y_1, Y_2, \dots, Y_D)^{T}$ is multivariate normal, specifically:
    \[
        \bm{Y} \sim \mathcal{N}(\bm{\mu}, \Sigma \bm{R} \Sigma),
    \]
    where $\bm{\mu} = (\mu_1, \mu_2, \dots, \mu_D)^{T}$ is the mean vector, and $\Sigma$ is a $D \times D$ diagonal matrix with $\Sigma_{ii} = \sigma_i$ for $i = 1, \dots, D$, and $\Sigma_{ij} = 0$ for $i \neq j$.
\end{theorem}

\begin{proof}
    Consider a Gaussian copula with univariate Gaussian marginals $\{Y_1, \dots, Y_D\}$, where each $Y_d \sim \mathcal{N}(\mu_d, \sigma_d^2)$. Let the copula distribution function $C(\bm{u})$ be given by
    \[
    C(\bm{u}) = \Phi_D(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_D) \mid \bm{0}, \bm{R}),
    \]
    where $\Phi_D(\cdot \mid \bm{0}, \bm{R})$ is the CDF of the $D$-dimensional standard normal distribution with correlation matrix $\bm{R}$, and $\Phi(\cdot)$ is the CDF of the standard normal distribution. The corresponding density function is:
    \[
    f(\bm{y}) = c(F_1(y_1), \dots, F_D(y_D)) \prod_{d=1}^{D} f_d(y_d),
    \]
    where $f_d(y_d)$ is the density of $Y_d \sim \mathcal{N}(\mu_d, \sigma_d^2)$ and $c(F_1(y_1), \dots, F_D(y_D))$ is the copula density. To compute the copula density, we differentiate $C(\bm{u})$ with respect to $u_1, \dots, u_D$:
    \[
    c(\bm{u}) = \frac{\partial C(\bm{u})}{\partial u_1 \dots \partial u_D}.
    \]
    Using the Gaussian copula formula, we obtain:
    \[
    c(\bm{u}) = \frac{\phi_D(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_D) \mid \bm{0}, \bm{R})}{\prod_{d=1}^{D} \phi(\Phi^{-1}(u_d))},
    \]
    where $\phi_D(\cdot \mid \bm{0}, \bm{R})$ is the PDF of the multivariate normal distribution with mean zero and correlation matrix $\bm{R}$, and $\phi(\cdot)$ is the standard univariate normal PDF.

    Next, recall that for any Gaussian random variable $Y_d \sim \mathcal{N}(\mu_d, \sigma_d^2)$, we have:
    \[
    u_d = F_d(y_d) = \Phi\left( \frac{y_d - \mu_d}{\sigma_d} \right).
    \]
    By Lemma \ref{lemma:gaussian-cdf} (below), the inverse CDF of the standard normal, $\Phi^{-1}(u_d)$, satisfies:
    \[
    \Phi^{-1}(F_d(y_d)) = \frac{y_d - \mu_d}{\sigma_d}.
    \]
    Therefore, substituting into the copula density, we get:
    \[
    c(F_1(y_1), \dots, F_D(y_D)) = \frac{\phi_D\left( \frac{y_1 - \mu_1}{\sigma_1}, \dots, \frac{y_D - \mu_D}{\sigma_D} \mid \bm{0}, \bm{R} \right)}{\prod_{d=1}^{D} \frac{1}{\sigma_d} \phi\left( \frac{y_d - \mu_d}{\sigma_d} \right)}.
    \]

    Now, combining this with the marginal densities, we obtain the joint density:
    \[
    f(\bm{y}) = \phi_D\left( \frac{y_1 - \mu_1}{\sigma_1}, \dots, \frac{y_D - \mu_D}{\sigma_D} \mid \bm{0}, \bm{R} \right) \prod_{d=1}^{D} \frac{1}{\sigma_d}.
    \]
    Finally, multiplying by the product of the univariate densities $f_d(y_d)$ gives:
    \[
    f(\bm{y}) = \phi_D(\bm{y} \mid \bm{\mu}, \Sigma \bm{R} \Sigma),
    \]
    which is the PDF of a multivariate Gaussian distribution with mean vector $\bm{\mu}$ and covariance matrix $\Sigma \bm{R} \Sigma$. Hence, the joint distribution of $\bm{Y}$ is multivariate normal, as desired.
\end{proof}

\begin{lemma}\label{lemma:gaussian-cdf}
    Let $\Phi^{-1}(\cdot)$ denote the inverse CDF of the standard normal distribution. For a Gaussian random variable $X \sim \mathcal{N}(\mu, \sigma^2)$, we have:
    \[
    \Phi^{-1}(F_X(x)) = \frac{x - \mu}{\sigma}.
    \]
\end{lemma}

\begin{proof}
    This follows by noting that $F_X(x) = \Phi\left( \frac{x - \mu}{\sigma} \right)$, and thus:
    \[
    \Phi^{-1}(F_X(x)) = \Phi^{-1}\left( \Phi\left( \frac{x - \mu}{\sigma} \right) \right) = \frac{x - \mu}{\sigma}.
    \]
\end{proof}


\section{DERIVING UNIFORMLY MARGINAL RANKS USING A GAUSSIAN COPULA}

In this section we outline the circumstances by where two different sets of marginal covariate distributions may yield the same marginal causal effect densities when assuming that $\hat{c}_{\YIZbdX}$ is a conditional copula density derived from a Gaussian copula. First and foremost, we want to emphasize that this is a rather strict scenario, and it is less likely to occur in real-world settings.

This assumes that the ranks of the marginal causal effect are distributed as follows:
\begin{align}
\label{eq:cond-gauss-cop}
    \Phi^{-1}(u_{\YIdX}) \mid \Phi^{-1}(u_{Z_1}), \dots, \Phi^{-1}(u_{Z_D}) \sim \mathcal{N}\left( \sum_{d=1}^{D} \beta_{d}\Phi^{-1}(u_{Z_d}),~1 - \sum_{d=1}^{D}\beta_{d}^{2}\right),
\end{align}
which assures that the marginal distribution of $\Phi^{-1}(u_{\YIdX}) \sim \mathcal{N}(0,1)$ if $\{\Phi^{-1}(u_{Z})\}_{i}$. Given \Cref{eq:cond-gauss-cop}, our question is whether there is another set of conditioning variables which yields the same marginal outcome of the conditional model.

We can rewrite \Cref{eq:cond-gauss-cop} as a linear combination of Gaussians:
\begin{align}
    \Phi^{-1}(u_{\YIdX}) = \sum_{d=1}^{D} \beta_{d} T_{d} + \epsilon
\end{align}
where $\epsilon \sim \mathcal{N}\left(0, 1 - \sum_{d=1}^{D}\beta_{d}^{2}\right)$, and $\{T_{d}\}_{d=1}^D$ are an arbitrary set of conditioning variables. If the marginal distribution of $\Phi^{-1}(u_{\YIdX})$ is Gaussian, then $\{T_{d}\}_{d=1}^D$ must each be Gaussian (Gaussian closure under linear marginalisation).

Our next question is finding which linear transformations of $\{T_{d}\}_{d=1}^D$ will yield a standard Gaussian distribution of $\Phi^{-1}(u_{\YIdX})$. Assume that $\{T_{d}\}_{d=1}^D$ yields a marginal distribution of $\Phi^{-1}(u_{\YIdX})$ which is standard Gaussian. Let us perform the change of variables transformation
\begin{equation*}
    W_{d} = \alpha_{d} T_{d} + \mu_{d}, ~\forall~d=\{1,\dots,D\}
\end{equation*}
where $\alpha_{d}$ and $\mu_{d}$ are all constants. Our goal is to identify a set of conditions for $\{(\alpha, \mu)\}_{d}$ whereby 
\begin{equation*}
    \mathbb{E}\left[\sum_{d=1}^{D} \beta_{d} W_{d} + \epsilon \right] = 0 \qquad \text{and} \qquad \mathbf{Var}\left[ \sum_{d=1}^{D} \beta_{d} W_{d} + \epsilon \right] = 1.
\end{equation*}
Starting with the expectation,
\begin{align}
    \mathbb{E}\left[\sum_{d=1}^{D} \beta_{d} W_{d} + \epsilon \right] &= \mathbb{E}\left[\sum_{d=1}^{D} \alpha_{d} \beta_{d} T_{d} + \sum_{d=1}^{D} \beta_{d} \mu_{d} \right] \\
    &= \sum_{d=1}^{D} \beta_{d} \mu_{d}.
\end{align}
Similarly for the variance,
\begin{align}
    \mathbf{Var}\left[\sum_{d=1}^{D} \beta_{d} W_{d} + \epsilon \right] &= \mathbf{Var}\left[\sum_{d=1}^{D} \alpha_{d} \beta_{d} T_{d} + \sum_{d=1}^{D} \beta_{d} \mu_{d} \right] + \mathbf{Var}[\epsilon] \\
    &= \sum_{d=1}^{D} (\alpha_{d}\beta_{d})^{2} + 1 - \sum_{d=1}^{D}\beta_{d}^{2}.
\end{align}
The set of variables by which we can exactly sample from the same marginal effect are if
\begin{equation*}
    W_{d} \sim \mathcal{N}(\mu_{d}, \alpha_{d}^2 )
\end{equation*}
for any $\{(\mu_{d}, \alpha_{d})\}, d\in [1,\ldots, D]$ if 
\begin{equation*}
    \sum_{d=1}^{D} \beta_{d}\mu_{d} = 0 \quad \text{and} \quad \sum_{d=1}^{D}(\alpha_{d}\beta_{d})^{2} = \sum_{d=1}^{D} \alpha_{d}^{2}.
\end{equation*}
This is indeed an extreme case. Given how rarely these conditions are satisfied, especially in high-dimensional settings where the copula function can become quite complex, it is not a significant concern for our work.

\section{MODELS}

We provide details of the models evaluated in our paper.

\paragraph{Engression} Engression proposed in \cite{shen2023engression} approximates the conditional distribution $P\left(Y\mid X\right)$ using a pre-additive noise model $Y = g(WX + \eta) + \beta^\top X$, where $g: \mathbb{R}^d \rightarrow \mathbb{R}$ is a non-linear function that captures non-linear relationships and $\eta = h(\epsilon)$ introduces flexible noise. Built on the neural network architecture that efficiently learns this structure, it optimizes the energy score loss for accurate distributional regression.

\paragraph{Meta-learners}
Meta-learners are flexible frameworks in causal inference designed to estimate individualized treatment effects by leveraging machine learning models. Two common types are T-learners and S-learners. Details can be found in \cite{kunzel2019metalearners}.

T-learners work by training separate models for the treated and untreated groups, predicting outcomes under each treatment condition, and then calculating the difference between these predictions to estimate the treatment effect.
S-learners combine both treated and untreated data into a single model by including treatment as an input feature, allowing the model to learn the outcome function across both treatment conditions simultaneously.
These learners provide a modular approach to estimating Conditional Average Treatment Effects (CATE) and can adapt to different settings and model complexities.
\paragraph{CausalForest}

CausalForest is an extension of random forests designed to estimate heterogeneous treatment effects by partitioning the data into subgroups with similar treatment responses. Introduced by \cite{wager2018estimation}, CausalForest uses a tree-based ensemble method to non-parametrically estimate Conditional Average Treatment Effects (CATE) by building separate models for different covariate regions, while ensuring a balance between treated and control units in each partition. This method is flexible and adapts to complex data structures, making it a powerful tool for understanding treatment effect heterogeneity.

\paragraph{BART} BART (Bayesian Additive Regression Trees), first introduced in  \cite{chipman2010bart}, is a non-parametric machine learning method that uses an ensemble of regression trees to model complex relationships between covariates and outcomes.  The BART model estimates the posterior distribution of the outcome by summing the contributions from many trees, each of which is trained to explain part of the residual error left by the others. This ensemble approach makes BART particularly effective at capturing complex, non-linear relationships between the covariates and the outcome. Unlike standard decision trees, BART applies a Bayesian framework, allowing it to quantify uncertainty in its predictions and avoid overfitting through regularization priors.

\paragraph{TARNet} TARNet (Treatment-Agnostic Representation Network), first introduced in \cite{johansson2016learning}, is a neural network-based model for estimating heterogeneous treatment effects in causal inference. It works by learning a shared representation of covariates, independent of treatment assignment, and then using this representation to estimate potential outcomes for both the treated and untreated groups. By focusing on treatment-agnostic representation learning, TARNet aims to improve the generalizability and accuracy of treatment effect estimates, particularly in high-dimensional settings.

\section{COMPUTATION DETAILS}

We provide computation details in the Experiment section. We use default recommended hyperparameters for each model.

\begin{table}[h]
\caption{Hyperparameters of Each Model} 
\label{tab:hyperparameter}
\begin{center}
\begin{tabular}{|l|p{8cm}|p{5cm}|}
\hline
\textbf{Model} & \textbf{Key Hyperparameters} & \textbf{Package} \\
\hline
TARNet & Number of layers = 2, batch size = 64, learning rate = 0.0001, number of epochs = 2000 & Python, \texttt{catenets} \citep{curth2021really} \\
\hline
CausalForest & Number of trees = 100, maximum depth = 3 & Python, \texttt{econml} \citep{econml} \\
\hline
S-/T-BART & Number of trees = 75, number of iterations = 4, number of burn-in iterations = 200, posterior draws = 800 & R, \texttt{dbarts} \citep{dbarts} \\
\hline
S-/T-engression & Number of layers = 3,   batch size = 64, learning rate = 0.01, number of epochs = 500 & Python, \texttt{engression}, \citep{engression}\\
\hline
\end{tabular}
\end{center}
\end{table}

All experiments were conducted on a MacBook with an Apple M3 chip, 8-core CPU, and 32GB RAM. The codes can be found in TestGeneralizability.zip.


\section{ADDITIONAL EXPERIMENTS}
We include an additional experiment we run in this section, which is based on the synthetic data setting in the main text, but without domain shift. We set the marginal distribution of $Z_1$, $Z_2$ to be $\mathcal{N}(1,1)$, and $Y(X) \sim \mathcal{N}(2X+1,1)$, $X\sim \operatorname{Bernoulli} (0.5)$. In this case, the CATE should be linear, as we mentioned in \Cref{sec:gaussian}. 

Result of when there is no domain shift can be found in \Cref{fig:synthetic_mean_p_noshift}. We see that the p-values of both S-LinearRegression and T-LinearRegression are uniformly distributed. Given the true CATE function is indeed linear, this result validates our proposed method.


\begin{figure}[h!]
\vspace{.3in}
\centerline{\includegraphics[width=0.5\linewidth]{synthetic_mean_p_noshift.png}}
\vspace{.3in}
\caption{$p$-values of Mean Regression Testing, Synthetic Data of 50 Iterations, No Domain Shift.}
\label{fig:synthetic_mean_p_noshift}
\end{figure}

We next test when there is domain shift, i.e., we keep all the settings the same as above for training set, but we change the marginal distribution of $Z_1$, $Z_2$ in the test set to be $\mathcal{N}(3,2)$. \Cref{fig:synthetic_mean_p_shift} shows the results. Linear regressions still demonstrate good generalizability performances! However, generalizability of algorithms like S-engression and S-BART worsens, likely due to problems such as overfitting.
\begin{figure}[h!]
\vspace{.3in}
\centerline{\includegraphics[width=0.5\linewidth]{synthetic_mean_p_shift_linear.png}}
\vspace{.3in}
\caption{$p$-values of Mean Regression Testing, Synthetic Data of 50 Iterations, with Domain Shift.}
\label{fig:synthetic_mean_p_shift}
\end{figure}

\newpage
\bibliography{references}
\vfill

\end{document}
