\documentclass[twoside]{article}

% algorithm
\usepackage{algorithm}
% \usepackage[noend]{algpseudocode}
\usepackage[noend]{algorithmic}
% \usepackage{algorithmic}
\usepackage{amsfonts}
\usepackage{bbold}
\usepackage{graphicx}
\usepackage{amsthm,amsmath,bm}
\usepackage{cleveref}

\newtheorem{theorem}{Theorem}
\newtheorem{definition}{Definition}
\newtheorem{lemma}[theorem]{Lemma}

\usepackage{soul}

\newcommand{\RR}{I\!\!R} %real numbers
\newcommand{\Nat}{I\!\!N} %natural numbers
\newcommand{\CC}{I\!\!\!\!C} %complex numbers

\newcommand{\YZIX}{{Y\hspace{-1pt}Z \hspace{-.5pt} | \hspace{-.5pt}X}}
\newcommand{\ZYIX}{{Z\hspace{-1pt}Y \hspace{-.5pt} | \hspace{-.5pt}X}}
\newcommand{\ZbYIX}{{\bm{Z}\hspace{-.25pt}Y \hspace{-.5pt} | \hspace{-.5pt}X}}
\newcommand{\ZYIdX}{{Z\hspace{-1pt}Y \hspace{-.5pt} | \hspace{-.5pt}\text{do}(X)}}
\newcommand{\ZbYIdX}{{\bm{Z}\hspace{-.5pt}Y \hspace{-.5pt} | \hspace{-.5pt}\text{do}(X)}}
\newcommand{\ZbYx}{{\bm{Z}\hspace{-.5pt}Y(x) \hspace{-.5pt}}}
\newcommand{\YIZbdX}{{\hspace{-.5pt}Y \hspace{-.5pt} | \hspace{-.5pt} \bm{Z}\hspace{-1pt}, \hspace{-0.5pt} \text{do}(X)}}
\newcommand{\YxIZb}{{\hspace{-.5pt}Y(x) \hspace{-.5pt} | \hspace{-.5pt} \bm{Z}}}
\newcommand{\YIZdX}{{\hspace{-.5pt}Y \hspace{-.5pt} | \hspace{-.5pt} Z\hspace{-1pt}, \hspace{-0.5pt} \text{do}(X)}}
\newcommand{\YZIXC}{{Y\hspace{-1pt}Z \hspace{-.5pt} | \hspace{-.5pt}X\hspace{-.5pt}C}}
\newcommand{\YcolonZdX}{{Y\hspace{-1pt};\bm{Z} \hspace{-.5pt} | \hspace{-.5pt}\text{do}(X)}}
\newcommand{\YIXZ}{{Y \hspace{-.5pt} | \hspace{-.5pt}X \hspace{-1pt}Z}}
\newcommand{\YIdX}{{Y \hspace{-.5pt} | \hspace{-.5pt}\text{do}(X) \hspace{-1pt}Z}}
\newcommand{\YIZX}{{Y \hspace{-.5pt} | \hspace{-.5pt}Z \hspace{-1pt}X}}
\newcommand{\YIZbX}{{Y \hspace{-.5pt} | \hspace{-.5pt}\bm{Z} \hspace{-.5pt}X}}
\newcommand{\ZXY}{{\hspace{-.5pt}Z \hspace{-1pt} X \hspace{-.75pt} Y}}
\newcommand{\ZbXY}{{\hspace{-.5pt}\bm{Z} \hspace{-1pt} X \hspace{-.75pt} Y}}
\newcommand{\XY}{{X \hspace{-.75pt} Y}}
\newcommand{\ZX}{{\hspace{-.5pt} Z \hspace{-1pt} X}}
\newcommand{\YIZ}{Y \hspace{-.5pt} | \hspace{-.75pt} Z}
\newcommand{\YZ}{Y \hspace{-1pt} Z}
\newcommand{\YIX}{{Y\hspace{-.5pt}|\hspace{-.5pt} X}}
\newcommand{\ZIX}{{Z \hspace{-.5pt}|\hspace{-.5pt} X}}
\newcommand{\ZbIX}{{\bm{Z} \hspace{-.5pt}|\hspace{-.5pt} X}}
\newcommand{\XIZ}{{\hspace{-.75pt} X \hspace{-.5pt}|\hspace{-.5pt} Z}}
\newcommand{\XIZb}{{\hspace{-.75pt} X \hspace{-.5pt}|\hspace{-.5pt} \bm{Z}}}
\newcommand{\indep}{\rotatebox[origin=c]{90}{$\models$}}

\newcommand{\cmid}{\,|\,}

% \usepackage{aistats2025}
% If your paper is accepted, change the options for the package
% aistats2025 as follows:
%
\usepackage[accepted]{aistats2025}
%
% This option will print headings for the title of your paper and
% headings for the authors names, plus a copyright note at the end of
% the first column of the first page.

% If you set papersize explicitly, activate the following three lines:
%\special{papersize = 8.5in, 11in}
%\setlength{\pdfpageheight}{11in}
%\setlength{\pdfpagewidth}{8.5in}

% This option will print headings for the title of your paper and
% headings for the authors names, plus a copyright note at the end of
% the first column of the first page.
% If you set papersize explicitly, activate the following three lines:
%\special{papersize = 8.5in, 11in}
%\setlength{\pdfpageheight}{11in}
%\setlength{\pdfpagewidth}{8.5in}

% If you use natbib package, activate the following three lines:
\usepackage[round]{natbib}
\renewcommand{\bibname}{References}
\renewcommand{\bibsection}{\subsubsection*{\bibname}}

% If you use BibTeX in apalike style, activate the following line:
\bibliographystyle{apalike}

\begin{document}

% If your paper is accepted and the title of your paper is very long,
% the style will print as headings an error message. Use the following
% command to supply a shorter title of your paper so that it can be
% used as headings.
%
%\runningtitle{I use this title instead because the last one was very long}

% If your paper is accepted and the number of authors is large, the
% style will print as headings an error message. Use the following
% command to supply a shorter version of the authors names so that
% they can be used as headings (for example, use only the surnames)
%
%\runningauthor{Surname 1, Surname 2, Surname 3, ...., Surname n}

\twocolumn[

\aistatstitle{Testing Generalizability in Causal Inference}

\aistatsauthor{Daniel de Vassimon Manela\textsuperscript{1,*} \And Linying Yang\textsuperscript{1,*} \And Robin J. Evans\textsuperscript{1}}
\aistatsaddress{\textsuperscript{1}Department of Statistics, University of Oxford \\
% \texttt{\{manela,linying.yang,evans\}@stats.ox.ac.uk} \\
\textsuperscript{*}Equal Contribution} ]
% \aistatsaddress{\textsuperscript{1}Department of Statistics, University of Oxford \\
% \textsuperscript{*}Equal Contribution \\
% \texttt{manela@stats.ox.ac.uk}, \texttt{linying.yang@stats.ox.ac.uk}, \texttt{evans@stats.ox.ac.uk}}

\begin{abstract}
Ensuring robust model performance in diverse real-world scenarios requires addressing generalizability across domains with covariate shifts. However, no formal procedure exists for statistically evaluating generalizability in machine learning algorithms. Existing methods often rely on arbitrary proxy predictive metrics like mean squared error, but do not directly answer whether a model can or cannot generalize. To address this gap in the domain of causal inference, we propose a systematic framework for statistically evaluating the generalizability of high-dimensional causal inference models. Our approach uses the frugal parameterization to allow for flexible simulations from fully and semi-synthetic causal benchmarks, offering a comprehensive evaluation for both mean and distributional regression methods. Grounded in real-world data, our method ensures more realistic evaluations, which is often missing in current work relying on simplified datasets. Furthermore, using simulations and statistical testing, our framework is robust and avoids over-reliance on conventional metrics, providing statistical safeguards for decision making. 
\end{abstract}

\section{INTRODUCTION}

Model generalizability has garnered significant interest in causal inference \citep{bareinboim2016causal,curth2021really, johansson2018learning, buchanan2018generalizing,ling2022critical,bica2022transfer}. This encompasses transportability under covariate shifts between domains and extrapolation. In causal inference, it specifically refers to the ability of a causal model to make accurate predictions or draw valid conclusions when applied to a domain different from the one it was trained on. This concept is crucial when the objective involves understanding and predicting the effects of interventions across various settings. It holds particular importance in clinical contexts, where the growing interest in personalized treatment and patient stratification underscores the need to generalize inferences across diverse populations.

Current approaches for evaluating model generalizability generally involve using predictive metrics like AUC for classification or MSE for regression \citep{zhou2022domain,yu2024survey}. However, these metrics do not directly answer the question of interest, that is, \emph{whether a model can or cannot generalize}.  Does an MSE of 5 on another domain imply that the model does not generalize? How about an MSE of 1? Are these results and interpretations reproducible with statistical guarantees? How much does random noise affect these metrics? These are critical problems that should be carefully considered in causal inference questions involving multiple domains. It is essential to establish a systematic evaluation framework for generalizability performance, which can offer a robust and direct, reproducible evaluation of model performance on relevant tasks.

One approach to this problem is statistical testing, where we set the question of interest as the hypothesis we test against. However, it is difficult to obtain power against a wide-range of alternative hypotheses when performing tests conditional on a high-dimensional covariate set. This poses a problem for causal practitioners as they are often interested in modeling quantities such as the individual treatment effect. 

% For generalizability in causal inference, the critical task is to determine, with controlled probability of making errors due to random noise, whether models can generalize causal insights across (task-specific) populations.
    
\paragraph{Main Contributions} We propose a systematic framework for statistically  evaluate the generalizability of high-dimensional causal inference algorithms by targeting low-dimensional causal margins. Rather than relying on arbitrary metrics, we provide a testing framework that statistically evaluates the transportability of both mean and distributional regression methods. 

% Our method includes a semi-synthetic simulation framework using two domains---training (A) and testing (B)---that share the same intervened conditional outcome distributions, but potentially differ in their covariate and treatment distributions.  A model is trained on domain A to \textbf{learn the shared high dimensional conditional outcome distribution}. We test the model's generalizability by estimating the marginal causal quantities in domain B, where these values are \emph{explicitly known}. This is made possible through the use of the frugal parameterization \citep{evans2024parameterizing}. Our approach simplifies the evaluation process by reducing the complexity from higher-dimensional intervened models to a lower-dimensional causal effect, enabling more powerful statistical testing.

Our method includes a semi-synthetic simulation framework using two domains, training ($A$) and testing ($B$), which have different covariate ($\bm{Z}$) and treatment ($X$) distributions, but whose \emph{conditional outcome distribution} (COD, $Y(x) \mid \bm{Z}$) is assumed to be the same. First, we fit a frugally parameterized model \citep{evans2024parameterizing} to learn the COD $P_{Y(x)|\bm{Z}}$ on domain $B$. The frugal parameterization allows us to obtain the \emph{marginal outcome distribution} (MOD) of $Y(x)$ on domain B % marginal causal density within the full joint distribution 
explictly as part of the joint. 
% Next, we specify covariates and treatment distributions in domain A to be different from domain B. 
We then generate semi-synthetic outcome samples of domain A by applying the COD of domain $B$, while using the covariates and treatments from domain $A$. 

Next, we train the causal model of interest on these semi-synthetic samples in domain A, and use it to estimate marginal causal quantities for domain $B$. The model's generalizability is assessed by statistically testing its ability to recover marginal causal quantities from $B$ against the \emph{explicitly known} ground truth inferred earlier. By reducing the complexity from higher-dimensional intervened models to a lower-dimensional causal effect, our approach simplifies the evaluation process, enabling more \st{robust and} powerful statistical testing.


% With our method, we are able to derive the explicit, known values of these marginal quantities. We thus assess the generalizability of the trained model by constructing estimations and true marginal quantities. This approach simplifies the evaluation from a higher-dimensional intervened model to lower-dimensional marginal quantities, facilitating statistical testing.

% \paragraph{DAN's DRAFT EDITS} 
% A high-level overview of the workflow is as follows. We first define two domains $A$ and $B$ to be our training and test domains, respectively. A frugal model is fit to data from $B$ to learn the Conditional Outcome Distribution (COD), $p^{B}_{Y(x)|\bm{Z}}$. The choice of a frugal model allows us to explicitly parameterize the marginal causal density in domain $B$. Next, we generate semi-synthetic outcomes for domain $A$ corresponding to the COD defined in $B$: $\hat{y}_{i}^{A} \sim p^{B}_{Y(x)|\bm{Z}}(\bm{z}_{i}^{A},~x_{i}^{A})$. A model $\hat{f}$ is then fit on the semi-synthetic data from domain $A$, $\{(\bm{z}^{A},x^{A},\hat{y}^{B})\}_{i}$, and then sued to estimate outcomes on data from $B$, $\hat{y}^{B}_{i} = \hat{f}(\bm{z}^{B}_{i}, x^{B}_{i})$. These are then used to estimate marginal quantities, which are then statistically tested against the underlying truth modeled by $p^{B}_{Y(x)|\bm{Z}}$.

% A high-level overview of the workflow is as follows:
% \begin{enumerate}
%     \item \textbf{Learn both the distribution parameters of two domains, and the Conditional Outcome Distribution (COD) from real-world data}: Define two domains, domain A and domain B, of which the covariate and treatment distributions can be different, but the COD is the same. These distributions can be learned empirically from real-world data, rather than just being limited to specifying parametric models.
    
%     \item \textbf{Model training}: Simulate semi-synthetic data from domain A using the distributions fitted on data in step 1. Train a conditional effect model on the simulated data.
    
%     \item \textbf{Prediction/Estimaton}: Simulate data from domain B. Apply the trained model on the sampled covariates and treatments from domain B and estimate marginal causal quantities outcome predictions from the model.
    
%     \item \textbf{Evaluate generalizability with statistical testing}: Statistically test whether the sampled outcomes deviate significantly from the known ground truth in domain B. This provides an evaluation of the model's generalizability under covariate and treatment distribution shifts. The tests assess whether the model generalizes effectively by focusing on lower-dimensional quantities instead of high-dimensional conditional outcome models.
% \end{enumerate}
The proposed method builds on the availability of marginal causal quantities in domain B. In some real applications, it is usually the marginal quantities that are reported. For example, in many studies analyzing COVID-19 outcomes, researchers reported untreated outcomes, such as mortality rates or symptom progression, to contextualize treatment effects. The untreated mortality rate for severe COVID-19 in \cite{recovery2021dexamethasone} is often cited as a benchmark for evaluating interventions like dexamethasone. Our method thus  provides a simple and effective solution for assessing generalizability of an algorithm in complicated (real-world) data with statistical guarantees, including Type-I error control.
% A high-level overview of the workflow of our method is as follows. (1) We first define two domains, domain A (training) and domain B (testing), of which the covariate and treatment distributions can be different, but the COD is the same. These distributions can be learned empirically from real-world data, rather than just limited to specifying parametric models. (2) Next, we simuluate semi-synthetic data from domain A using pre-specified distributions. Train a conditional effect model on the simulated data. (3) We then simulate data from domain B, whose covariate and treatment distributions may differ from domain A, but with an identical COD. Apply the trained model on the sampled covariates and treatments from domain B and estimate marginal causal quantities outcome predictions from the model. (4) Finally, we statistically test whether the estimated marginal causal quantities deviate significantly from the known ground truth in domain B. This provides an evaluation of the model's generalizability under covariate and treatment distribution shifts. The tests assess whether the model generalizes effectively by focusing on lower-dimensional quantities (marginal causal distributions) instead of high-dimensional conditional outcome models.

% \paragraph{Main Contributions} In this work, we propose a formal framework for statistically testing the generalizability of machine learning algorithms under covariate and treatment distribution shifts, specifically in the context of causal inference. Rather than simply relying on predictive metric scores, we provide tests that statistically evaluate the ability of both mean and distributional regression methods regarding generalizability. 
% % Our approach is built on \textbf{frugal parameterization}\cite{evans2024parameterizing}, enabling simulations from various data-generating processes as well as real-world data.  
% % In real applications, generalizability is particularly dependent on key properties from real data. For instance, sample size may affect performances of algorithms like neural networks to play the balance between in-sample performance and generalizability. Complex data structures can also play a crucial rule. This is why our simulation-based method is so important: it offers a comprehensive approach to evaluate model generalizability across diverse scenarios, providing a simple and effective solution to account for these complexities in real-world applications. 
% % In real-world applications, generalizability depends on factors such as sample sizes and the complexity of the data structures. Our proposed simulation-based method offers a comprehensive framework for quantitatively evaluating model generalizability across diverse scenarios. 
% This provides a simple and effective solution for assessing how well algorithms account for these complexities in real-world applications. 

% Consequently, we claim that our evaluation method is:
% \begin{itemize}
%     \item \textbf{\textit{Systematic}}: We offer a structured framework that allows users to easily specify and input flexible data generation processes for simulations from various data generation processes.
%     \item \textbf{\textit{Robust}}: We incorporate statistical testing to evaluate the generalizability of distributional and mean regression models, evaluating model generalizability by directly and providing statistical safeguard for decision making, which proxy predictive measures like MSE fail to do.
%     \item \textbf{\textit{Realistic}}: Simulations can be based on actual data, bridging the gap between synthetic evaluations and real-world applications.
% \end{itemize}
% Consequently, we claim that our evaluation method is \textbf{\textit{systematic}} - we offer a structured framework that allows users to easily specify and input flexible data generation processes for simulations, \textbf{\textit{comprehensive}} - our method supports simulations from various data generation processes, covering both continuous and discrete covariates and outcomes,
% % , with distributions like Gamma, Exponential, Gaussian, etc), 
% \textbf{\textit{robust}} - we incorporate statistical testing to evaluate the generalizability of distributional and mean regression models, and \textbf{\textit{realistic}} - simulations can be based on actual data, bridging the gap between synthetic evaluations and real-world applications.

\section{BACKGROUND}

Throughout the paper, we consider a static treatment model with an outcome $Y \in \mathcal{Y}\subseteq \mathbb{R}$ and a general treatment $X$ which can be either continuous or discrete. Let the set $D$ of measured pretreatment covariates be $\bm{Z} \in \mathcal{Z}\subseteq \mathbb{R}^{D}$. 
If we make the standard causal assumptions of SUTVA, positivity, and conditional ignorability outlined in \citet{pearl2009causality}, we define the density of the marginal \textit{causal} treatment distribution as
% 
\begin{equation}
    p_{Y(x)}(y(x)) = \int p_{\YIZbX}(y \cmid \bm{z}, x) ~ p_{\bm{Z}}(\bm{z})~d\bm{z};
\end{equation} 
this is marginalized over the covariates.  Here $Y(x)$ is the \emph{potential outcome} for $Y$ given that $X$ is set to a value $x$.

% We distinguish between the marginal \textit{conditional} treatment density which is the marginalization over the observational dataset:
% \begin{equation}
%     p_{\YIX} = \int p_{\YxIZb} ~ p_{\ZbIX}~d\bm{z},
% \end{equation}
% and the marginal \textit{causal} treatment density:
% \begin{equation}
%     p_{Y(x)} = \int p_{\YxIZb} ~ p_{\bm{Z}}~d\bm{z}.
% \end{equation} which is the marginal from the randomized model.

We also use $\mu(x) = \mathbb{E}\, Y(x)$ to denote the expected outcome given an intervention that sets $\{X=x\}$, and $\mu(x,z) = \mathbb{E}\left[Y(X=x)\mid Z=z\right]$ to denote the conditional expectation given covariate values. Note that $Y(x)$ is essentially equivalent to $Y \mid \text{do}(X=x)$ in the notation of \citet{pearl2009causality}. When the treatment is binary, we define $\tau = \mathbb{E}[Y(1) -Y(0)]$ as the average treatment effect (ATE), quantifying the overall impact of a treatment change across the entire population. Similarly, let $\tau(z) = \mathbb{E}[Y(1) -Y(0)\mid Z=z]$ be the conditional average treatment effect (CATE), giving the result for specific subgroups or individuals, and therefore capturing treatment effect heterogeneity.

Denote the probability measures in domain A and domain B as $P^A$, $P^B$ respectively. Since our scenario requires that the conditional outcome distributions are the same we have $P^A_{Y(x)\mid\bm{Z}}=P^B_{Y(x)\mid\bm{Z}}$; however, since the covariate and treatment distributions may differ, the corresponding equality between the \emph{marginal} causal distributions does not necessarily hold. 

We aim to evaluate the generalizability of an outcome regression model $\hat{f}(\bm{z},x)$ that predicts the expected outcome $Y$. Predicted outcomes are denoted by $\hat{y}_i :=\hat{f}(\bm{z}_i,x_i)$.

\subsection{Generalizability in Causal Inference}

Extensive research has focused on generalizability in causal inference, as mentioned in the introduction. 
% Recently, combining Randomized Controlled Trials (RCT) data with observational data has shown promise for improving CATE estimations in real-world settings. Calibrating outcome models with observational data helps models trained on RCTs better generalize to diverse populations \citep{curth2021really}.
As highlighted by \cite{ling2022critical}, three common approaches are used to assess treatment effect generalizability: inverse probability of sampling weighting (IPSW) methods that adjust for differences between study and target populations by weighting based on sample inclusion probabilities \citep{buchanan2018generalizing}; outcome models that estimate the conditional outcome directly \citep{kern2016assessing}; and hybrid approaches that combines both \citep{dahabreh2019generalizing}.

In this work, we focus on algorithms that generalize conditional outcome predictions across different domains, enabling accurate CATE or COD estimation. This is crucial for understanding individual-level treatment effect heterogeneity and ensuring models can adapt to new populations or environments with varying covariate distributions. A summary of common CATE estimation methods is provided by \cite{caron2022estimating}.

% Despite advancements in CATE estimation, a systematic framework for evaluating generalizability is still underdeveloped. Commonly current methods, like MSE and Precision in Estimation of Heterogeneous Effect (PEHE), provide limited real-world insights \citep{curth2021really,kiriakidou2022evaluation}. 

% \paragraph{DAN'S EDITS} 
Despite advancements in CATE estimation, a systematic framework for evaluating generalizability remains underdeveloped. For example,  \citet{johansson2018learning} validate their model using both simulated and real world data. The simulated data examples assess predictive generalizability with MSE in the absence of any treatment mechanism, making causal verification impossible. Additionally, their analysis of the IHDP dataset \citep{hill2011bayesian} does not involve covariate or treatment shifts, so it does not effectively test generalizability. Another relevant paper is \citet{shi2021invariant}, which measures out-of-domain generalization performance using the mean absolute error (MAE). While their method achieves the lowest MAE among competitors,  there is no formal criterion to determine whether a specific MAE value signifies sufficient generalization to a new domain.

We highlight these issues not as criticisms of the papers, but to emphasize that robust generalizability evaluation methods of causal models are missing and challenging. Furthermore, existing benchmarks like IHDP are not specifically designed for out-of-domain generalization tests. To address this gap, we propose a systematic semi-synthetic framework to evaluate how well CATE algorithms perform across domains with different covariate distributions, offering a more practical assessment of whether a given approach will generalize well. In \Cref{sec:IHDP}, we adapt the IHDP experiments presented in \citet{johansson2018learning} and extend them by generating datasets from different domains. Furthermore, we contrast the predictive MSE scores with the p-values derived from our tests to show how the latter provides a more actionable metric for whether a model successfully generalizes or not.
% We demonstrate its implementation on the IHDP data in \Cref{sec:IHDP}.

\subsection{Frugal Parameterization}\label{subsec:frugal-params}
A frugal parameterization of an observational joint distribution, $P_{\ZbXY}$, factorizes the distribution into a set of causally relevant components~\citep{evans2024parameterizing}. This decomposition explicitly parameterizes the marginal causal distribution, $P_{Y(x)}$, or other lower dimensional causal distribution $P_{Y(x)|W}, W\subset Z$, and builds the rest of the model around it. 

Let us start by first parameterizing the \textit{conditional outcome distribution} (COD), $P_{\YxIZb}$. Frugal models can parameterize the COD in terms of the marginal causal distribution, $P_{Y(x)}$, and a conditional copula distribution, $C_{\YxIZb}$. Here, $C_{\YxIZb}$ models the joint dependency between the marginal causal distribution and each of the univariate marginal covariate distributions, $\{P_{Z_i}\}_{i}$ such that:
\begin{equation}
    p_{\YxIZb} = p_{Y(x)} \cdot  c_{\YxIZb},
\end{equation}
where $c_{\YxIZb}$ is  a copula density function (see Supplementary Material for further details on copulas) that parameterizes the dependence between $Y(x)$ and the covariates.
% \begin{equation}
%     C_{\YxIZb} := C\left(F_{Y(x)} \mid F_{Z_1},\dots,~F_{Z_{D}} \right).
% \end{equation}
% We present a summary of copulas in the appendix for unfamiliar readers, but in short, copulas provide a framework for encoding dependencies between marginal quantities in such a way that the marginal distributions are preserved.
This leaves the distribution of the \textit{past}, %$P_{\bm{Z}X}$, 
i.e.~the covariate distribution and the propensity score. We assume that all covariates are strictly pretreatment, so $\bm{Z}$ does not include any mediators of the causal effect of $X$ on $Y$. If we use a conditional copula then the past and the COD are variation independent, in the sense that they parameterize separate, non-overlapping aspects of the joint distribution~\citep{evans2024parameterizing}. This allows the past to be freely specified without affecting either the conditional copula, or the marginal causal distribution. 

The frugal parameterization also allows us to chose a conditional estimand.  For example, if we were interested in a conditional average treatment effect given $\bm{W} \subset \bm{Z}$, we could write $p_{Y(x)|\bm{Z}} = p_{Y(x)|\bm{W}} \cdot c_{Y(x)|\overline{\bm{Z}}; \bm{W}}$ where $\overline{\bm{Z}} = \bm{Z}\setminus\bm{W}$.  Here $c_{Y(x)|\overline{\bm{Z}}; \bm{W}}$ is a pair-copula between $Y(x)$ and $\overline{\bm{Z}}$ conditional upon $\bm{W}$. This enables us to condition on a small subset of covariates that we consider to be particularly important in terms of predicting the outcome.


\section{METHOD}
\Cref{fig:algo-workflow} provides an overview of our workflow. We begin by defining both a test and a training domain, each with a distribution over the pretreatment covariates and the treatment, allowing for distribution shifts across covariates and treatment allocation. The COD is frugally parameterized with a conditional copula, where the covariates' cumulative distribution functions (CDFs) are derived from the test domain’s covariate densities. This ensures that samples from the test dataset follow a \textbf{known, customizable} marginal causal density, $p_{Y(x)}$.

The training data is generated from the same COD, though since the covariate densities may not match the CDFs used to parameterize the conditional copula we do not have access to the marginal causal distribution in closed-form. We then learn a model, $\hat{f}(\bm{z},x)$, on the training data. Finally, a statistical test is performed to validate whether the lower-dimensional marginal quantity (such as the ATE or an expected potential outcome)  estimated using model outcomes equals the ground truth in the test domain.


\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=1\linewidth]{algorithm.png}}
\vspace{.3in}
\caption{Workflow of the Proposed Method.}
\label{fig:algo-workflow}
\end{figure}

\subsection{Data Simulation}
In this section we describe how to simulate the data.

% This involves first parameterizing terms in the test domain,  defining the marginal causal density and the conditional copula density. We define an invariant across both domains in terms of the marginal causal density and the conditional univariate copula density. Everything else can change.

\subsubsection{Multi-domain Simulation with Frugal Models}
We begin by specifying two data generating processes: the training data, $D^{A} \sim P^{A}_{\bm{Z}XY}$, and the test data, $D^{B} \sim P^{B}_{\bm{Z}XY}$. Our goal is to construct a COD that parameterizes the joint density across both domains, while ensuring that the marginal causal density in domain $B$ is parameterized by $p_{Y(x)}$. 

Recall from \Cref{subsec:frugal-params} that a general observational density can be factorized into the \textit{past}, $p_{\bm{Z}X}$, and the COD, $p_{\YxIZb}$:
\begin{equation}\label{eq:cod}
    \begin{aligned}
        p&_{\YxIZb}(y \cmid \bm{z}) = p_{Y(x)}(y) \times \\ 
        & \qquad c_{\YxIZb}\!\left(F_{Y(x)}(y) \mid F_{Z_{1}}(z_1),\dots, F_{Z_{D}}(z_{d}) \right)
    \end{aligned}
\end{equation}
where $F_{Y(x)}$ is the CDF associated with the marginal causal density $p_{Y(x)}$. 

Note that the copula density in (\ref{eq:cod}) is not only determined by the copula's family and its parameterization, but also by the choice of marginal CDFs for the covariates, $\bm{Z}$. If the conditional copula density is marginalized over the densities corresponding to the covariate CDFs, then the ranks of the marginal causal density will be uniformly distributed:
\begin{equation*}
    p\left(F_{Y(x)}\right) = \int d\bm{z}~c_{\YxIZb}(y(x) \cmid \bm{z}) \cdot \prod_{d=1}^{D}p_{Z_{d}}(z_{d}) = 1.
\end{equation*}
This uniformity is guaranteed if the marginal covariate densities $\{ p_{Z_d} \}_{d=1}^{D}$ correspond to the CDFs used to parameterize the copula. Thus, data simulated using our method matches the marginal causal quantity we specify. 
% Generally, if we instead consider a set of alternative marginal densities, $\{p'_{Z_d}\}_{d=1}^{D}$, which are not derived from the CDFs within the copula, i.e. $F_{Z_{d}}(Z_{d} = t) \neq F_{Z'_{d}}(Z'_{d} = t)$ then the rank uniformity is not assured.

% However, . Thus, the COD density is generally valid under any distribution of the past, and will not in guarantee the sampling from the specified marginal causal density if the covariate densities are derived from the CDFs that parameterize the copula. In the Supplementary Material, we present the conditions by which alternative distributions will yield samples drawn from the specified marginal causal density, assuming that the conditional copula density is Gaussian. Given how rarely these conditions are satisfied, we do not believe this will commonly be encountered in semi-synthetic benchmark generation. These conditions will likely be even harder to satisfy if a more complex multivariate copula (such as non-Gaussian vines) is chosen. We refer the reader to the Supplementary Material for further details.

For evaluating generalization, we set the CDFs within the copula density to be derived from the covariate densities in the test domain $P_{\bm{Z}XY}^{B}$. This allows us to construct the COD density across all covariate spaces,
\begin{equation*}
    \begin{aligned}
        p&_{\YxIZb}\left( y \cmid \bm{z}\right) = p^{B}_{Y(x)}\left(y\right) \times \\ 
        & \qquad c_{\YxIZb}\!\left(F^{B}_{Y(x)}(y) \,\middle|\, F_{Z_1^{B}}(z_1), \dots, F_{Z_{D}^{B}}(z_D) \right)
    \end{aligned}
\end{equation*}
which will sample from a known marginal causal density equal to $p_{Y(x)}$ if the covariate CDFs in the copula are derived from the test domain covariate densities. 

For two joint distributions with the same marginal covariate densities but different marginal causal densities, their CODs must differ. This allows us to evaluate differences between CODs via comparing the lower-dimensional marginal causal densities instead.

This offers a great deal of flexibility in testing method generalizability. One can draw training and test datasets with different covariate densities and propensity scores, while guaranteeing that the CODs remain consistent, and that the test data is drawn from a distribution with a marginal causal density parameterized by $p_{Y(x)}$. 


% When it comes to simulation based on actual data, we begin the process by first learning the marginal distributions of the pretreatment covariates in the test set, $\{\hat{F}^{B}_{Z}\}_{i}$. In this paper, we estimate these distribution functions using the empirical CDF, though more flexible techniques like Kernel Density Estimation could also be employed. Next, we learn a parametric form of marginal causal distribution $\hat{p}^{B}_{Y(x)}$ alongside a multivariate Gaussian copula $\hat{c}^{B}_{\ZbYx}\left(\hat{F}^{B}_{Y(x)}, \hat{F}^{B}_{Z_1}, \dots, \hat{F}^{B}_{Z_D} \right)$ which captures the marginal dependency between all the covariates and the marginal causal distribution. While we assume a Gaussian copula for its simplicity, more flexible models such as vine copulas could also be used for greater flexibility. Finally, we estimate the propensity score model $\hat{p}^{B}_{\XIZb}$. With all these components,
% \begin{equation*}
%     \{\hat{F}^{B}_{Z_{1}}, \dots, \hat{F}^{B}_{Z_{D}}, \hat{p}^{B}_{Y(x)}, \hat{p}^{B}_{\XIZb},  \hat{c}^{B}_{\ZbYx} \}
% \end{equation*}
% we can draw samples from the test distribution.

% We follow similar steps to sample training datasets, estimating $\{\hat{F}^{A}_{Z}\}_{i}$ and the Gaussian copula between the covariates, $\hat{c}^{A}_{\bm{Z}}$. We use the conditional univariate copula $\hat{c}^{A}_{\YxIZb}$ derived from the multivariate testing copula, $\hat{c}^{B}_{\ZbYx}$. Finally, we estimate the training propensity score model, $\hat{p}_{\XIZb}$. With the following components,
% \begin{equation*}
%     \{\hat{F}^{A}_{Z_{1}}, \dots, \hat{F}^{A}_{Z_{D}}, \hat{p}^{B}_{Y(x)}, \hat{c}^{B}_{\YxIZb}, \hat{p}^{A}_{\XIZb},  \hat{c}^{A}_{\bm{Z}}, \}
% \end{equation*}
% we can draw training samples with covariates and treatment assignments similarly distributed to the real dataset, but with the same COD as testing data. In this way, we allow for covariate and treatment distribution shift, while ensuring the marginal causal density in the test data is exactly equal to $\hat{p}^{B}_{Y(x)}$.

% \begin{algorithm}[t]
% \caption{Semi-synthetic Data Generation Process.}
% \begin{algorithmic}%[1]
% \vspace*{2pt}
% \STATE{\textbf{Step 1}: Learn marginal covariate distributions $\hat{F}^{B}_{Z}$.}
% \STATE{\textbf{Step 2}: Estimate $\hat{p}^{B}_{Y(x)}$.}
% \STATE{\textbf{Step 3}: Learn the multivariate Gaussian copula $\hat{c}^{B}_{\ZbYx}$.}
% \STATE{\textbf{Step 4}: Calculate the conditional univariate copula $\hat{c}^{B}_{\YxIZb}$.}
% \STATE{\textbf{Step 5}: Estimate $\hat{p}^{B}_{\XIZb}$.}
% \STATE{\textbf{Step 6}: Simulate data from the test distribution.}
% \STATE{\textbf{Step 7}: Learn marginal distributions $\hat{F}^{A}_{Z}$ and the Gaussian copula $\hat{c}^{A}_{\bm{Z}}$.}
% \STATE{\textbf{Step 8}: Sample ranks for covariates in domain $A$, $\Phi^{-1}\left(\hat{\bm{u}}_{\bm{Z}}^{A}\right) \sim \hat{c}^{A}_{\bm{Z}}$.}
% \STATE{\textbf{Step 9}: Calculate training covariate samples, $\hat{z}^{A}_{i} = \hat{F}_{Z_{i}}^{A}^{-1}(u_{Z_i})$}
% \STATE{\textbf{Step 10}: Estimate the propensity score model $\hat{p}^{A}_{\XIZb}$.}
% \STATE{\textbf{Step 11}: Draw treatment samples, $\hat{x}^{A} \sim \hat{p}^{A}_{\XIZb}(\cdot \mid )$.}
% \STATE{\textbf{Step 12}: Calculate the marginal causal rank in domain $A$, $u_{Y(x)}^{A} \sim \hat{c}^{B}_{\YxIZb}\left( \cdot \mid \hat{F}_{Z_1}^{B}(u_{Z_1}^{A}), \dots, \hat{F}_{Z_D}^{B}(u_{Z_D}^{A}) \right)$}
% \STATE{\textbf{Step 13}: Simulate data from the training distribution.}
% \end{algorithmic}
% \label{alg:semisynthetic_data}
% \end{algorithm}
\subsubsection{Generating Semi-Synthetic Benchmarks}
% \begin{algorithm}[t]
% \caption{Semi-synthetic Data Generation Process.}
% \begin{algorithmic}[1]
% \STATE{\textbf{Step 1}: Estimate empirical CDFs for test data, $\hat{F}^{B}_{Z_d},~ \forall ~ d = 1, \dots, D$.}
% \STATE{\textbf{Step 2}: Estimate marginal causal density and joint copula for test data, $\hat{p}^{B}_{Y(x)}, \hat{c}^{B}_{\ZbYx}$.}
% \STATE{\textbf{Step 3}: Estimate propensity score model for test data, $\hat{p}^{B}_{\XIZb}$.}
% \STATE{\textbf{Step 4}: Draw samples from the test copula, $\hat{\bm{u}}_{\bm{Z}}^{B} \sim \hat{c}^{B}_{\ZbYx}$.}
% \STATE{\textbf{Step 5}: Inverse transform to get covariate samples for test domain, $\hat{z}_{d}^{B} = \hat{F}_{Z_d}^{B^{-1}}(u_{Z_d}^{B}), \forall d = 1, \dots, D$.}
% \STATE{\textbf{Step 6}: Sample treatment variable for test domain using the propensity score model, $\hat{x}^{B} \sim \hat{p}^{B}_{\XIZb}(\cdot \mid \bm{Z}^{B})$.}
% \STATE{\textbf{Step 7}: Inverse transform to get outcome samples for test data, $\hat{y}^{B} = \hat{F}_{Y(x)}^{B^{-1}}(u_{Y(x)}^{B})$, where $u_{Y(x)}^{B} \sim \hat{c}^{B}_{\YxIZb}\left( \cdot \mid \hat{F}_{Z_1}^{B}(u_{Z_1}), \dots, \hat{F}_{Z_D}^{B}(u_{Z_D}) \right)$.}
% \STATE{\textbf{Step 8}: Estimate empirical CDFs and copula for training data, $\hat{F}^{A}_{Z_d}, \hat{c}^{A}_{\bm{Z}}, \forall d = 1, \dots, D$.}
% \STATE{\textbf{Step 9}: Sample covariate distributions for training data, $\hat{\bm{u}}_{\bm{Z}}^{A} \sim \hat{c}^{A}_{\bm{Z}}$.}
% \STATE{\textbf{Step 10}: Inverse transform to get covariate samples for training data, $\hat{z}_{d}^{A} = \hat{F}_{Z_d}^{A^{-1}}(u_{Z_d}^{A}), \forall d = 1, \dots, D$.}
% \STATE{\textbf{Step 11}: Estimate propensity score model for training data, $\hat{p}^{A}_{\XIZb}$.}
% \STATE{\textbf{Step 12}: Sample treatment variable for training data, $\hat{x}^{A} \sim \hat{p}^{A}_{\XIZb}(\cdot \mid \bm{Z}^{A})$.}
% \STATE{\textbf{Step 13}: Calculate marginal causal rank for training data, $u_{Y(x)}^{A} \sim \hat{c}^{B}_{\YxIZb}\left( \cdot \mid \hat{F}_{Z_1}^{B}(u_{Z_1}^{A}), \dots, \hat{F}_{Z_D}^{B}(u_{Z_D}^{A}) \right)$.}
% \STATE{\textbf{Step 14}: Inverse transform to get outcome samples from training data, $\hat{y}^{A} = \hat{F}_{Y(x)}^{B^{-1}}\left(u_{Y(x)}^{A} \right)$.}
% \end{algorithmic}
% \label{alg:semisynthetic_data}
% \end{algorithm}

% \begin{algorithm}[h!]
% \caption{Semi-synthetic Data Generation Process.}
% \begin{algorithmic}
% % \scriptsize
% \vspace*{2pt}
% \STATE{\textbf{Input}:~Original test data; original covariates and treatment from training data.}
% \vspace*{2pt}
% \STATE{\textbf{Parameter estimations on test domain}} 

% \hspace*{8pt} Estimate test empirical CDFs, $\{\hat{F}^{B}_{Z_d}\}_{d=1}^D$; marginal causal density and joint copula $\hat{p}^{B}_{Y(x)}, \hat{c}^{B}_{\ZbYx}$; propensity score model $\hat{p}^{B}_{\XIZb}$.
% \vspace*{2pt}
% \STATE{\textbf{Transformation on test domain}}

% \hspace*{8pt} Sample $(\bm{u}_{\bm{Z}}^{B}, u_{Y(x)}) \sim \hat{c}^{B}_{\ZbYx}$;\\
% \hspace*{8pt} Calculate$\{z_{d}^{B} = \hat{F}_{Z_d}^{B^{-1}}(u_{Z_d}^{B})\}_{d=1}^D$;\\
% \hspace*{8pt} Sample $x^{B} \sim \hat{p}^{B}_{\XIZb}\left(\cdot \mid \bm{Z}^{B}\right)$;  $y^{B} = \hat{F}_{Y(x)}^{B^{-1}}(u_{Y(x)}^{B})$.

% \vspace*{2pt}
% \STATE{\textbf{Parameter estimations on training domain}}

% \hspace*{8pt} Estimate training empirical CDFs, copula and propensity score model $\{\hat{F}^{A}_{Z_d}\}_{d=1}^D,~\hat{c}^{A}_{\bm{Z}}$,  $\hat{p}^{A}_{\XIZb}$.

% \vspace*{2pt}
% \STATE{\textbf{Transformation on training domain}} 

% \hspace*{8pt} Sample $u_{\bm{Z}}^{A} \sim \hat{c}^{A}_{\bm{Z}}$;\\
% \hspace*{8pt} Calculate $\{z_{d}^{A} = \hat{F}_{Z_d}^{A^{-1}}(u_{Z_d}^{A})\}_{d=1}^D$;\\
% \hspace*{8pt} Sample $x^{A} \sim \hat{p}^{A}_{\XIZb}(\cdot \mid \bm{z}^{A})$;
% \hspace*{8pt} Sample $u_{Y(x)}^{A} \sim \hat{c}^{B}_{\YxIZb}( \cdot \mid \hat{F}_{Z_1}^{B}(z_{1}^{A}), \dots, \hat{F}_{Z_D}^{B}(z_{D}^{A}))$;\\
% \hspace*{8pt} Sample  $y^{A} = \hat{F}_{Y(x)}^{B^{-1}}\left(u_{Y(x)}^{A} \right)$.

% \vspace*{2pt}
% \STATE{\textbf{Output}: $D^{A} = \{(z^{A}, x^{A}, y^{A})\}_{i}$}, $D^{B} = \{(\bm{z}^{B}, x^{B}, y^{B})\}_{i}$ 
% \end{algorithmic}
% \label{alg:semisynthetic_data}
% \end{algorithm}
% \begin{algorithm}[h!]
% \caption{Semi-synthetic Data Generation.}
% \begin{algorithmic}
% % \scriptsize
% \vspace*{2pt}
% \STATE{\textbf{Input}:~Original test data; original covariates and treatment from training data.}
% \vspace*{2pt}
% \STATE{\textbf{Parameter estimations on test domain}} 

% Estimate test empirical CDFs, $\{\hat{F}^{B}_{Z_d}\}_{d=1}^D$; marginal causal density, $\hat{p}^{B}_{Y(x)}$; joint copula , $\hat{c}^{B}_{\ZbYx}$; propensity score model $\hat{p}^{B}_{\XIZb}$.
% \vspace*{2pt}
% \STATE{\textbf{Transformation on test domain}}

% Sample $(\bm{u}_{\bm{Z}}^{B}, u_{Y(x)}) \sim \hat{c}^{B}_{\ZbYx}$;\\
% Calculate $\{z_{d}^{B} = \hat{F}_{Z_d}^{B^{-1}}(u_{Z_d}^{B})\}_{d=1}^D$;\\
% Sample $x^{B} \sim \hat{p}^{B}_{\XIZb}\left(\cdot \mid \bm{Z}^{B}\right)$;\\
% Calculate $y^{B} = \hat{F}_{Y(x)}^{B^{-1}}(u_{Y(x)}^{B})$.

% \vspace*{2pt}
% \STATE{\textbf{Parameter estimation on training domain}}

% Estimate training empirical CDFs, $\{\hat{F}^{A}_{Z_d}\}_{d=1}^D$; covariate copula, $\hat{c}^{A}_{\bm{Z}}$;  propensity score model,  $\hat{p}^{A}_{\XIZb}$.

% \vspace*{2pt}
% \STATE{\textbf{Transformation on training domain}} 

% Sample $u_{\bm{Z}}^{A} \sim \hat{c}^{A}_{\bm{Z}}$;\\
% Calculate $\{z_{d}^{A} = \hat{F}_{Z_d}^{A^{-1}}(u_{Z_d}^{A})\}_{d=1}^D$;\\
% Sample $x^{A} \sim \hat{p}^{A}_{\XIZb}(\cdot \mid \bm{z}^{A})$; \\
% Sample $u_{Y(x)}^{A} \sim \hat{c}^{B}_{\YxIZb}( \cdot \mid \hat{F}_{Z_1}^{B}(z_{1}^{A}), \dots, \hat{F}_{Z_D}^{B}(z_{D}^{A}))$;\\
% Calculate  $y^{A} = \hat{F}_{Y(x)}^{B^{-1}}\left(u_{Y(x)}^{A} \right)$.

% \vspace*{2pt}
% \STATE{\textbf{Output}: Training sample $D^{A} = (\bm{z}^{A}, x^{A}, y^{A})$};\\ \hspace*{40pt} Test sample $D^{B} = (\bm{z}^{B}, x^{B}, y^{B})$. 
% \end{algorithmic}
% \label{alg:semisynthetic_data}
% \end{algorithm}
\begin{algorithm}[h!]
\caption{Semi-synthetic Data Generation.}
\begin{algorithmic}
% \scriptsize
\vspace*{2pt}
\STATE{\textbf{Input}:~Original test data; original covariates and treatment from training data.}
\vspace*{2pt}
\STATE{\textbf{Parameter estimations on test domain $B$}} 

Estimate the joint covariate-treatment density, $\hat{p}^{B}_{\bm{Z}X}$; marginal causal density, $\hat{p}^{B}_{Y(x)}$; conditional copula , $\hat{c}^{B}_{\YxIZb}$.
% \vspace*{2pt}
\STATE{\textbf{Data simulation on domain $B$}}

Sample $(\bm{z}^{B}, x^{B}) \sim \hat{p}^{B}_{\bm{Z}X}$;\\
Sample the causal effect rank $\hat{u}^{B}_{Y(x)|\bm{Z}} \sim U[0,1]$;\\
Calculate $y^{B} = \hat{F}_{Y(x)}^{B^{^{-1}}}\left(\hat{c}^{B}_{Y(x)|\bm{Z}}(\hat{u}^{B}_{Y(x)|\bm{Z}} \mid \bm{z}^{B})\right)$.

\vspace*{2pt}
\STATE{\textbf{Parameter estimation on training domain $A$}}

Estimate the joint covariate-treatment density, $\hat{p}^{A}_{\bm{Z}X}$.

\vspace*{2pt}
\STATE{\textbf{Data simulation on domain $A$}} 

Sample $(\bm{z}^{A}, x^{A}) \sim p^{A}_{\bm{Z}X}$;\\
Sample the causal effect rank $\hat{u}_{Y(x)|\bm{Z}}^{A} \sim U[0,1]$;\\
Calculate $y^{A} = \hat{F}_{Y(x)}^{B^{^{-1}}}\left(\hat{c}^{B}_{Y(x)|\bm{Z}}(\hat{u}_{Y(x)|\bm{Z}}^{A} \mid \bm{z}^{A})\right)$.

\vspace*{2pt}
\STATE{\textbf{Output}: Training sample $D^{A} = (\bm{z}^{A}, x^{A}, y^{A})$};\\ \hspace*{40pt} Test sample $D^{B} = (\bm{z}^{B}, x^{B}, y^{B})$. 
\end{algorithmic}
\label{alg:semisynthetic_data}
\end{algorithm}

Our primary workflow follows the approach outlined in \Cref{alg:semisynthetic_data}. 
First, we estimate the joint covariate-treatment density of the test data, denoted as $\hat{p}^{B}_{\bm{Z}X}$. We then estimate the marginal causal density $\hat{p}^{B}_{Y(x)}$ and the conditional copula $\hat{c}^{B}_{\YxIZb}$, capturing the covariate-outcome dependency conditional on treatment. Given covariate and treatment samples, we can calculate the causal density rank, $\hat{u}_{Y(x)}$ using the conditional copula. The outcome can be calculated using the inverse transform $\hat{F}_{Y(x)}^{B^{^{-1}}}$. For the training data, we follow a similar approach. A general summary of how to simulate from this workflow can be found in \Cref{alg:semisynthetic_data}.
% First, we estimate the empirical CDFs of the pretreatment covariates of the test data, denoted as $\hat{F}^{B}_{Z_d},~\forall ~d = \{1,\dots, D\}$. We then estimate the marginal causal density $\hat{p}^{B}_{Y(x)}$ and the joint copula $\hat{c}^{B}_{\ZbYx}$, capturing the covariate-outcome dependency conditional on treatment. With the test copula known, we draw samples $\bm{u}_{\bm{Z}}^{B} \sim \hat{c}^{B}_{\ZbYx}$, and use inverse transforms to generate the covariate samples $z_{d}^{B} = \hat{F}_{Z_d}^{B^{-1}}(u_{Z_d}^{B})$. Next, we estimate the propensity score model for the test data, $\hat{p}^{B}_{\XIZb}$ and sample the treatment variable $x^{B} \sim \hat{p}^{B}_{\XIZb}(\cdot \mid \bm{z}^{B})$. The outcome data is calculated using $y^{B} = \hat{F}_{Y(x)}^{B^{^{-1}}}(u_{Y(x)}^{B})$, where $u_{Y(x)}^{B}$ is the sampled outcome rank from the copula. For the training data, we follow a similar approach. A general summary of how to simulate from this workflow can be found in \Cref{alg:semisynthetic_data}. %\Cref{alg:semisynthetic_data}. %With this approach we get the semi-synthetic samples from test domain and training domain.

% First, we estimate the empirical CDFs $\hat{F}^{A}_{Z_d},~\forall ~ d = \{1, \dots, D\}$ and the covariate copula $\hat{c}^{A}_{\bm{Z}}$. We draw samples from this copula, $\hat{\bm{u}}_{\bm{Z}}^{A}$, and perform an inverse transform to generate the actual covariate samples, $\hat{z}_{d}^{A} = \hat{F}_{Z_d}^{A^{-1}}(\hat{u}_{Z_d}^{A})$.

% We then estimate the propensity score model for the training data, $\hat{p}^{A}_{\XIZb}$, and use it to sample the treatment variable, $\hat{x}^{A} \sim \hat{p}^{A}_{\XIZb}(\cdot \mid \hat{\bm{z}}^{A})$. The marginal causal rank for the training data is calculated as $\hat{u}_{Y(x)}^{A}$, using the copula $\hat{c}^{B}_{\YxIZb}$ from the test data:
% \begin{align*}
%     \hat{u}_{Y(x)}^{A} &\sim \hat{c}^{B}_{\YxIZb}\big( \cdot \mid \\
%     &\hat{F}_{Z_1}^{B}(\hat{F}_{Z_1}^{A^{-1}}(u_{Z_1}^{A})), \dots, \hat{F}_{Z_D}^{B}(\hat{F}_{Z_D}^{A^{-1}}(u_{Z_D}^{A})) \big).
% \end{align*}
% Finally, we perform an inverse transform to obtain the outcome samples for the training data, $\hat{y}^{A} = \hat{F}_{Y(x)}^{B^{-1}}(\hat{u}_{Y(x)}^{A})$, where we make sure to use the marginal causal distribution parameters and the conditional copula $\hat{c}^{B}_{\YxIZb}$ are derived from the test data to ensure the test and training CODs are identical.

% 3) Sample from copula

% 4) Draw samples from the copula, and inverse CDF transform to draw samples from Z.

% 5) Estimate propensity score

% 6) Sample treatment

% 7) Sample outcome using $y^{(i)} = \hat{F}_{Y(x)}^{B}(u^{(i)}_{Y(x)} \mid x^{(i)})$

% 8) Do the same for the training data, except make sure that when sampling the quantiles from the sample outcome, $u_{Y(x)}^{A} \sim \hat{c}^{B}_{\YxIZb}\left( \cdot \mid u_{Z_1}^{A}, \dots, u_{Z_D}^{A} \right)$

% 9) Sample the training outcome $y^{A} = \hat{F}^{B}_{Y(x)}\left( u_{Y(x)}^{A} \mid x^{(i)}_{A} \right)$


\subsection{Statistical Testing}

Tests of hypotheses about high-dimensional objects have very little power if we wish to consider a wide range of alternatives.  The lower-dimensional objects can potentially increase the chance of rejection substantially if the null hypothesis fails to hold. Given that we know the marginal causal density parameterized by $p^B_{Y(x)}$ from the frugal parameterization, we are able to develop statistical testing on  
$\mu^B(x)$ rather than $\mu^B(\bm{z}, x)$ for mean regression models, and $P^B_{Y(x)}$ instead of $P^B_{\YxIZb}$ for distributional regression.
%$\mathcal{H}_0: \mathbb{E}\hat{\mu}(x) = \mu(x)$ instead of $\mathcal{H}_0: \mathbb{E}\hat{\mu}(x,\bm{z}) =  \mu(x,\bm{z})$ for mean regression models, and $\mathcal{H}_0: \hat{P}_{Y(x)} = P_{Y(x)}$ instead of $\mathcal{H}_0: \hat{P}_{\YxIZb} = P_{\YxIZb}$ for distributional regression.


Our testing algorithms require some parameters: $N_{btp}$ as the number of bootstrap samples, $N^{A}$, $N^{B}$ as the number of samples simulated from training domain and test domain for each bootstrap, respectively. We provide the testing algorithm for mean regression method in \Cref{alg:mean_test_algo}, but our algorithm can be extended to distributional regression models: after applying $\hat{f}$ on $D^{B}_b$, for each $i$,  we sample $\{y^j_{ib}\}_{j=1}^{N_Y}$ from the predicted distribution, $\hat{P}_{Y(x_{ib})|z_{ib}}$, and estimate marginal causal distributions such as $\hat{P}^{B}_{Y\left(x^0\right)} :=\bigcup_{b=1}^{N_{btp}} \bigcup_{i=1}^{N^{B}} \bigcup_{j=1}^{N_Y} \left\{ y_{ib}^j \mid x_{ib} = x^0 \right\}$. We then conduct distribution tests, e.g.~the Kolmogorov-Smirnov test, for $\mathcal{H}_0:  \hat{P}^{B}_{Y(x^0)}=P^{B}_{Y(x^0)}$ and get the p-value $p$.

Our testing algorithm is flexible in the choice of testing reference, e.g.~in \Cref{alg:mean_test_algo}, we can replace $\mu^{B}(x)$ with $\tau^{B}$ as the reference target when $X$ is binary, which is what we used in our experiments. The testing method used for distributional regression models can also be replaced by other statistical tests, such as the Maximum Mean Discrepancy Test \citep{gretton2012kernel} or the Cramér-von Mises Test \citep{anderson1962distribution}.


%for distributional testing, we also need to specify $N_Y$, which is the number of outcome samples simulated from distributional regression output for each $\hat{f}(x,\bm{z})$. 

% We provide testing methods for two types of regression models: mean regression in  \Cref{alg:mean_test_algo} or distributional regression in  \Cref{alg:dist_test_algo}. Note that, in \Cref{alg:mean_test_algo}, we can replace $\mu^{B}(x)$ with $\tau^{B}$ as the reference target when $X$ is binary, which is what we used in our experiments. 



\begin{algorithm}[t]
\caption{Generalizability Evaluation on Mean Regression Models.}
\begin{algorithmic}%[1] % this prints line nubmers
% \scriptsize
\vspace*{2pt}
\STATE{\textbf{Input}:~~$\Theta^{A}$: parameters for training domain,\\
% \hspace*{34.3pt} $B$: number of bootstrap samples,\\
% % \hspace*{34.3pt} $\delta$: confidence level,\\
% \hspace*{34.3pt} $N^{B}$: number of $(X,Z)$ samples simulated for each bootstrap,\\
\hspace*{34.3pt} $\Theta^{B}$: parameters for test domain,\\
\hspace*{34.3pt} $\mu^{B}(x^0)$: reference.}
\vspace*{3pt}
\FOR{$b=1, \ldots, {N_{btp}}$}
    \STATE{Draw $D_b^{A}:= \{(\bm{z}'_{ib}, x'_{ib}, y'_{ib})\}_{i=1}^{N^{A}} \sim P_{\Theta^{A}}$};
    \STATE{Fit the regression model, $\hat{f}$, on $D_b^{A}$};
    \STATE{Draw $D_b^{B}:= \{(\bm{z}_{ib}, x_{ib})\}_{i=1}^{N^{B}} \sim P_{\Theta^{B}}$};
    \STATE{Apply $\hat{f}$ on $D_b^{B}$ to get predictions $\{\hat{f}(\bm{z}_{ib}, x_{ib})\}_{i = 1}^{N^{B}}$}; 
    \STATE{Calculate 
    $$\hat{\mu}_b^{B}(x^0) = \frac{\sum_{i=1}^{N^{B}} \mathbb{1}(x_{ib}=x^0)\hat{f}(\bm{z}_{ib}, x_{ib})}{\sum_{i=1}^{N^{B}}\mathbb{1}(x_{ib}=x^0)}.$$}
    % $$\hat{\mu}_b^{B}(x^0) = \frac{1}{\sum_{i=1}^{N^{B}}\mathbb{1}(x_{ib}=x^0)}\sum_{i=1}^{N^{B}} \mathbb{1}(x_{ib}=x^0)\hat{f}(x_{ib},\bm{z}_{ib})$$}.
\ENDFOR
\STATE{\textbf{end for}}
\vspace*{3pt}
% \STATE{Denote $l^{B}$, $u^{B}$ as the $(1-\delta)/2$ and $1-(1-\delta)/2$ quantiles of $\{\hat{\mu}_b^{B}(c)\}_{b=1}^B$}.
% \IF{$\mu^{B}\in \left[l^{B}, u^{B}\right]$}
\STATE{Get the p-value $p$ by conducting a t-test to compare the target parameter $\mu^{B}(x^0)$ and the distribution of $\{\hat{\mu}_b^{B}(x^0)\}_{b=1}^{N_{btp}}$}.
% \IF{$\mu^{B}\in \left[l^{B}, u^{B}\right]$}
% \STATE{\textbf{Return} True.}
% \ELSE
% \STATE{\textbf{Return} False.}
% \ENDIF
\STATE{\textbf{Return} $p$.}
\vspace*{3pt}
\end{algorithmic}
\label{alg:mean_test_algo}
\end{algorithm}

% \begin{algorithm}[t]
% \caption{Generalizability Evaluation on Distributional Regression Models.}
% \begin{algorithmic}%[1] % this prints line numbers
% \vspace*{2pt}
% \STATE{\textbf{Input}:~~$\hat{f}$: fitted distributional regression model,\\
% \hspace*{34.3pt} $B$: number of bootstrap samples,\\
% % \hspace*{34.3pt} $\alpha$: significance level,\\
% \hspace*{34.3pt} $N^{B}$: number of $(X,Z)$ samples generated in each bootstrap,\\
% \hspace*{34.3pt} $N_Y$: number of $Y$ samples simulated from distributional regression output $\hat{f}(X,Z)$,\\
% \hspace*{34.3pt} $\Theta^{B}$: parameters for test domain,\\
% \hspace*{34.3pt} $\mathbb{P}(Y|do(X=c))$: reference.}
% \vspace*{3pt}
% \FOR{$b=1, \ldots, B$}
%     \STATE{Draw sample data $D_b^{B}:= \{(X_{ib},Z_{ib})\}_{i=1}^{N^{B}} \sim P_{\Theta^{B}}$};
%     \STATE{Apply $\hat{f}$ on $D_b^{B}$ to get distributional predictions $\hat{\mathbb{P}}\left(Y|X_{ib}, Z_{ib}\right)$};
%     \STATE{For each $i$, sample $\{Y^j_{ib}\}_{j=1}^{N_Y}$ from $\hat{\mathbb{P}}(Y|X_{ib}, Z_{ib})$}.
% \ENDFOR
% \STATE{\textbf{end for}}
% \vspace*{3pt}
% \STATE{Estimate $ \smash{\hat{P}(Y \mid do(X) = c) = \bigcup_{b=1}^{B} \bigcup_{i=1}^{N^{B}} \bigcup_{j=1}^{N_Y} \left\{ Y_{ib}^j \mid X_{ib} = c \right\}}$.}
% \STATE{Conduct distribution tests, e.g., the Kolmogorov-Smirnov test, to evaluate $\mathcal{H}_0:\hat{P}(Y \mid do(X) = c) =P(Y \mid do(X) = c)$ and get p-value $p$.}
% % \IF{$p>\alpha$}
% % \STATE{\textbf{Return} True.}
% % \ELSE
% % \STATE{\textbf{Return} False.}
% % \ENDIF
% \STATE{\textbf{Return} $p$.}
% \vspace*{3pt}
% \end{algorithmic}
% \label{alg:dist_test_algo}
% \end{algorithm}
% \begin{algorithm}[t]
% \caption{Generalizability Evaluation on Distributional Regression Models.}
% \begin{algorithmic}%[1] % this prints line numbers
% % \scriptsize
% \vspace*{2pt}
% \STATE{\textbf{Input}:~~$\Theta^{A}$: parameters for training domain,\\
% % \hspace*{34.3pt} $B$: number of bootstrap samples,\\
% % % \hspace*{34.3pt} $\alpha$: significance level,\\
% % \hspace*{34.3pt} $N^{B}$: number of $(X,Z)$ samples simulated in each bootstrap,\\
% % \hspace*{34.3pt} $N_Y$: number of $Y$ samples simulated from distributional regression output $\hat{f}(X,Z)$,\\
% \hspace*{34.3pt} $\Theta^{B}$: parameters for test domain,\\
% \hspace*{34.3pt} $P^{B}_{Y(x^0)}$: reference.} \\
% % \hspace*{34.3pt} $P_{Y(x)}^{B}(\cdot \mid x)$: reference.}
% \vspace*{3pt}
% \FOR{$b=1, \ldots, N_{btp}$}
%     \STATE{Sample $D_b^{A}:= \{(\bm{z}'_{ib}, x'_{ib},y'_{ib})\}_{i=1}^{N^{A}} \sim P_{\Theta^{A}}$}; 
%     \STATE{Fit the distributional regression model, $\hat{f}$, on $D_b^{A}$};
%     \STATE{Sample $D_b^{B}:= \{\bm{z}_{ib}, x_{ib})\}_{i=1}^{N^{B}} \sim P_{\Theta^{B}}$}; 
%     \STATE{Apply $\hat{f}$ on $D_b^{B}$ to get distributional predictions $\hat{P}_{Y(x_{ib})|z_{ib}}$};
%     \STATE{For each $i$, sample $\{y^j_{ib}\}_{j=1}^{N_Y}$ from $\hat{P}_{Y(x_{ib})|z_{ib}}$}.
% \ENDFOR
% \STATE{\textbf{end for}}
% \vspace*{3pt}
% \STATE{Estimate $\hat{P}^{B}_{Y\left(x^0\right)} =$}
% % \STATE{\hspace{1em} $\hat{P}(Y \mid do(X) = c) =$}
% \STATE{\hspace{2em} $\bigcup_{b=1}^{N_{btp}} \bigcup_{i=1}^{N^{B}} \bigcup_{j=1}^{N_Y} \left\{ y_{ib}^j \mid x_{ib} = x^0 \right\}$}.
% \STATE{Conduct distribution tests, e.g.~the Kolmogorov-Smirnov test, for $\mathcal{H}_0:  \hat{P}^{B}_{Y(x^0)}=P^{B}_{Y(x^0)}$ and get the p-value $p$.}
% \STATE{\textbf{Return} $p$.}
% \vspace*{3pt}
% \end{algorithmic}
% \label{alg:dist_test_algo}
% \end{algorithm}



A summary of this workflow is presented in \Cref{fig:algo-workflow}.


\section{EXPERIMENTS}
\label{sec:experiment}

In this section, we use our workflow to evaluate the generalizability of a range of modern causal models.


As discussed in several review papers like \cite{curth2021really}, \cite{ling2022critical} and 
 \cite{kiriakidou2022evaluation}, methods such as Meta-Learners (e.g.~T- and S-learners) \citep{kunzel2019metalearners}, CausalForest \citep{wager2018estimation}, TARNet \citep{shalit2017estimating}, and BART \citep{chipman2010bart} are widely used for CATE estimation, each offering advantages in different scenarios. Our evaluation focuses on their performance under covariate distribution shifts, specifically examining the accuracy of their CATE estimations. Further details can be found in the Supplementary Material. 


Another interesting algorithm to be evaluated is engression, introduced in \cite{shen2023engression}. It approximates the conditional distribution using a pre-additive noise model. Targeting at a distributional regression, the model is capable of extrapolating to unseen or underrepresented data points through its learned non-linear transformations.  The key factors which affect engression's generalizability are the distances between two domains, and whether the true underlying function must be strictly monotonic in the extrapolation region. In our experiments, we evaluate engression in both the S-learner and T-learner settings.

\subsection{Synthetic Data}

We first conduct experiments on synthetic data to demonstrate and validate our method. While our approach can handle various data types and is particularly effective with high-dimensional covariates and continuous treatment interventions, for clarity, in this simple example, we focus on two continuous confounders, $Z_1$ and $Z_2$, sampled from identical gamma distributions, with a binary intervention $X$. We initially assume that both datasets come from  randomized controlled trials (RCT), so that $X \sim \operatorname{Bernoulli}(0.5)$ under $P^A$ and $P^B$.  We parameterize the Gaussian copula, $c_{\ZbYx}$, with Spearman correlation coefficients $\rho_{Z_1 Z_2} = 0$, $\rho_{Z_1Y(x)} = 0.1$ and $\rho_{Z_2Y(x)} = 0.9$. The distribution of $Y(x)$  is defined as $\mathcal{N}(1+2x,1)$ in the test domain. For the simulation, we generate $N^{A} = 200$ training samples and 
$N^{B} = 50$ test samples per bootstrap, with $N_{btp}=200$ bootstraps in total, repeating this process for 50 iterations. The marginal distributions of $Z_1$ and $Z_2$ in the training domain follow identical Gamma distributions with shape $k=1$ and rate $\theta=1$.

We examine two settings: in Setting 1, the test domain has a slight covariate shift, with $Z_1$ and $Z_2$ following a Gamma distribution of $k=2$, $\theta=1$. In Setting 2, the shift is more significant ($k=4$, $\theta=1$). Despite these shifts, the COD remains the same due to frugal parameterization, as shown in \Cref{fig:synthetic}.


\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=1\linewidth]{synthetic_group.png}}
\vspace{.3in}
\caption{Synthetic Data Generated from Setting 1 (Top) and Setting 2 (Bottom). }
\label{fig:synthetic}
\end{figure}

The p-values in \Cref{fig:synthetic_mean_p} illustrate the differences across models. As expected, with a more significant domain shift in Setting 2, models face greater difficulty in generalizing, as reflected by the smaller p-values generally compared to Setting 1. T-BART and T-engression showed good generalizability performances in this specific setting with their p-values being uniformly distributed. TARNet struggles, likely due to the complexity of its representation learning network design and hyperparameter tuning.

\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=0.8\linewidth]{synthetic_mean_p.png}}
\vspace{.3in}
\caption{$p$-values of Mean Regression Testing, Synthetic Data of 50 Iterations.}
\label{fig:synthetic_mean_p}
\end{figure}

With our method, we are able to test the generalizability of distributional regression. \Cref{fig:synthetic_distribution_p} demonstrates the p-values of distributional regression testing of S-engression under the two settings, with $N_Y=50$. Not surprisingly, since the covariate distribution shift in Setting 1 is smaller, S-engression demonstrates better generalizability compared to that in Setting 2.

\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=0.8\linewidth]{synthetic_distribution_p_new.png}}
\vspace{.3in}
\caption{$p$-values of Distributional Regression Testing (Kolmogorov–Smirnov Test) of S-engression, Synthetic Data of 50 Iterations.}
\label{fig:synthetic_distribution_p}
\end{figure}

Supported by flexible simulations based on actual data, our method is useful for stress testing and model diagnostics. \Cref{fig:varying_n} illustrates an example where we examine how varying the training set size affects the generalizability of T-BART and T-engression. The generalizability performances of T-BART and T-engression worsen as $N^{A}$ exceeds 100. This issue may stem from problems like overfitting, but solving these problems is not our focus. Rather, our method serves as a tool to detect and highlight potential issues when making predictions on real data, which is feasible with the simulation based on actual data using the frugal parameterization.

\begin{figure}[t]
\vspace{.3in}
\centerline{\includegraphics[width=0.8\linewidth]{varying_n_train.png}}
\vspace{.3in}
\caption{$p$-values of Mean Regression Testing of 50 Iterations, Varying $N^{A}$, Setting 2, Synthetic Data.}
\label{fig:varying_n}
\end{figure}

Note that extrapolation performance for models like engression is typically evaluated visually, one dimension at a time. Our method, however, offers significant advantages by providing statistical evaluation of extrapolation performance in high-dimensional covariates.

Only Gaussian copulas with a fully connected dependency structure were used in our experiments of this section. However, this framework can be generalized to pair copula constructions that allow for modeling non-Gaussian copulas with more complex dependency structures and with a range of higher dimensional covariates. Further details for each of these cases can be found in the Additional Experiments of Supplementary Material, investigating how algorithms' generalizability varies with different dependency structure.

\subsection{Real Data}
\label{sec:IHDP}
We evaluate algorithm generalizability using the Infant Health and Development Program (IHDP) dataset, a randomized experiment conducted between 1985 and 1988 to study the effect of home visits on infants' cognitive test scores~\citep{hill2011bayesian}. This dataset has become widely used in domain adaptation research \citep{curth2021really,shi2021invariant}. In this section, we adapt the experiments presented by \citet{johansson2018learning} which train a range of causal ML algorithms on IHDP data and measure in-domain predictive performance using MSE. We extend these experiments by showing how our validation framework can be used to test out-of-domain predictive performance. Specifically, we compare the MSE metric against the p-values obtained via our proposed testing framework, highlighting how our method provides a more informative metric of whether a model can generalize robustly across different domains.

The IHDP dataset contains $T=1000$ trials, each consisting of the same 747 subjects and 25 covariates, with the first six being continuous and the rest binary.  The potential outcomes $Y(1)$ and $Y(0)$ are provided in the data. In each trial $t$, $Y(0) \sim \mathcal{N}(\bm{Z}\beta_t,1)$, $Y(1) \sim \mathcal{N}(\bm{Z}\beta_t+4,1)$, and $\beta_t$ is randomly chosen from values $(0, 1, 2, 3, 4)$ with probabilities $(0.5, 0.2, 0.15, 0.1,0.05)$. Thus, the potential outcomes vary across trials, while the covariates, CATE and ATE remain constant.

First we treat both domains as RCTs, that is, setting the propensity score model as $X\sim \operatorname{Bernoulli}(0.5)$ for all units. The observed outcome is then $Y = X Y(1) + (1-X) Y(0)$ by SUTVA. We randomly select 50 trials from the 1000 available, with each trial used to create one training-test pair, and evaluate the model's generalizability on them. To introduce domain shift, we keep all covariate values identical between the training and test domains, except for $Z_1$, which is set to 1.5 times the original value in the test domain compared to the training domain. For each training-test pair, we learn the parameters following \Cref{alg:semisynthetic_data}, specifying the marginal causal distribution to follow a Gamma distribution. We denote the resulting data generation distributions as $P_{\Theta^{A}}, P_{\Theta^{B}}$ for the training and test domains, respectively. We sample training data of $N^{A} = 1000$ from $P_{\Theta^{A}}$, and $N^{B} = 200$ test data from $P_{\Theta^{B}}$. The number of bootstraps is set to be $N_{btp} = 200$. 

\Cref{fig:ihdp_mean} shows the boxplot of the $\operatorname{log}_{10}$(p-values) of each model and \Cref{tab:ihdp_percentage} contains the percentage of $p$ values greater than 0.05 across the $50$ trials.  T-/S-engression demonstrate better generalizability in this setting among all these methods.  We also give the result of distributional regression testing in \Cref{fig:ihdp_dist}. 


\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=0.8\linewidth]{ihdp_shift.png}}
\vspace{.3in}
\caption{Density of $Z_1$ of Training and Test Domains.}
\label{fig:ihdp_shift}
\end{figure}
\begin{figure}[t]
\vspace{.3in}
\centerline{\includegraphics[width=0.8\linewidth]{ihdp_mean_log.png}}
\vspace{.3in}
\caption{$\operatorname{log}_{10}(p\text{-values})$ of Mean Regression Testing of 50 Trials in IHDP.}
\label{fig:ihdp_mean}
\end{figure}

\begin{table}[h]
\caption{Percentage of $p > 0.05$, across 50 Trials.} 
\label{tab:ihdp_percentage}
\begin{center}
\begin{tabular}{rrr}
\hline
\textbf{Model} & \textbf{RCT} & \textbf{Non-RCT} \\
\hline
TARNet & 0 & 0 \\

CausalForest & 12\% & 6\%\\

S-BART & 12\% & 8\% \\

T-BART & 12\% & 6\% \\

S-engression & 18\% & 6\%\\
T-engression & 24\% & 8\%\\
\hline
\end{tabular}
\end{center}
\end{table}


\begin{figure}[t]
\vspace{.3in}
\centerline{\includegraphics[width=0.8\linewidth]{ihdp_dist_new_edit.png}}
\vspace{.3in}
\caption{$p$-values of Distributional Regression Testing of 50 Trials in IHDP.}
\label{fig:ihdp_dist}
\end{figure}

% We cover two simulation scenarios: Randomized Controlled Trials (RCT) and covariate imbalances across treatment arms by introducing propensity score models. 

While we use the RCT setting as an example above to demonstrate our method, it is also applicable to observational studies. In a non-randomized setting where treatment arms are imbalanced by setting $P(X=1 \mid Z) = \operatorname{logit}(Z_2+Z_3+Z_4)$,  the percentage of $p>0.05$ across 50 trials of each algorithm is shown in \Cref{tab:ihdp_percentage}. Since our paper's focus is on providing a systematic generalizability evaluation method, we omit further analysis here.
% \begin{figure}[t]
% \vspace{.3in}
% \centerline{\includegraphics[width=0.8\linewidth]{ihdp_mean_obs.png}}
% \vspace{.3in}
% \caption{$p$-values of Mean Regression Testing across 50 Iterations, Non-randomized Study.}
% \label{fig:ihdp_mean_obs}
% \end{figure}

Details on hyperparameters and additional experiments, including performance comparisons with or without domain shift when the CATE is known to be linear, are provided in the Supplementary Material. Although our approach is mainly designed for evaluation, we provide additional experiments addressing capability of our method handling complicated dependency structures and model misspecification.

\section{SUMMARY}

In this paper, we develop a statistical method for evaluating the generalizability of causal inference algorithms using actual application data, facilitated by frugal parameterization. Our approach introduces a semi-synthetic simulation framework that bridges the gap between synthetic simulations and real-world applications, supporting the generalizability evaluation of both mean and distributional regression models. Through flexible, user-defined data generation processes, our framework provides robust statistical testing to assess how well models trained in one domain generalize to shifted domains. 

Through experiments on the synthetic and IHDP datasets, we assess the generalizability of algorithms such as TARNet, CausalForest, S-/T-BART, and S-/T-engression under domain shift. Our method acts as a valuable diagnostic tool, allowing us to explore how factors like training set size or covariate shifts impact generalizability. These insights can help identify model strengths and weaknesses and inform how causal inference models adapt to different settings.

In \Cref{sec:experiment}, we experimented with only Gaussian copulas with a fully connected dependency structure on a relatively small number of covariates. However, our framework can be extended to high-dimensional covariates settings and more complex dependency structures. For example, pair-copula constructions allow for flexible modeling  of non-Gaussian copulas with complex dependency structures. Further details and experiments for each of these cases can be found in the Supplementary Material.


Our approach of rejecting the null hypothesis shows that a model is not generalizable, but it does not quantify the extent of failure. An extension of this approach may be to develop a more flexible testing method, inspired by equivalence testing \citep{wellek2002testing}. This would assess not just whether a model fails but also by how much, determining if its performance is significantly worse than a given threshold, offering a more nuanced view than traditional hypothesis testing. In this paper, we only consider marginal causal quantities as the validation references, but our framework can be easily adapted to use lower dimensional CODs as the reference instead with the flexibility of frugal parameterization (see \Cref{subsec:frugal-params}).

We hope that this work inspires a more careful consideration of model evaluation, encourages simulations that better reflect real-world conditions, and highlights the importance of stress testing in advancing causal inference methodologies.

\clearpage
% \newpage



\subsubsection*{Acknowledgements}
The authors would like to thank Laura Battaglia and Xing Liu for their helpful comments on the paper.

D.d.V.M is supported by a studentship from the UK’s EPSRC's Doctoral Training Partnership (EP/T517811/1). L.Y. is supported by the EPSRC Centre for Doctoral Training in Modern Statistics and Statistical Machine Learning (EP/S023151/1) and Novartis. 


% \bibliographystyle{plainnat}


\bibliography{references}
\clearpage
% \section*{Checklist}


% % %%% BEGIN INSTRUCTIONS %%%
% % The checklist follows the references. For each question, choose your answer from the three possible options: Yes, No, Not Applicable.  You are encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description (1-2 sentences). 
% % Please do not modify the questions.  Note that the Checklist section does not count towards the page limit. Not including the checklist in the first submission won't result in desk rejection, although in such case we will ask you to upload it during the author response period and include it in camera ready (if accepted).

% % \textbf{In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.}
% % %%% END INSTRUCTIONS %%%


%  \begin{enumerate}


%  \item For all models and algorithms presented, check if you include:
%  \begin{enumerate}
%    \item A clear description of the mathematical setting, assumptions, algorithm, and/or model. [Yes] \textit{We do our utmost to make this clear in our submission.}
%    \item An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes] \textit{We are explicit about the sample sizes used in the paper, and have no inference algorithms as such to report.}
%    \item (Optional) Anonymized source code, with specification of all dependencies, including external libraries. [Yes] \textit{We attach a requirements file to our submitted code.}
%  \end{enumerate}


%  \item For any theoretical claim, check if you include:
%  \begin{enumerate}
%    \item Statements of the full set of assumptions of all theoretical results. [Yes] \textit{We make this clear in either the main body or the Supplementary Material.}
%    \item Complete proofs of all theoretical results. [Yes] \textit{Relevant proofs are either referenced or added to the Supplementary Material.}
%    \item Clear explanations of any assumptions. [Yes] \textit{We tried our best to make them clear.}
%  \end{enumerate}


%  \item For all figures and tables that present empirical results, check if you include:
%  \begin{enumerate}
%    \item The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL). [Yes] \textit{All relevant code is included in our attached code. All external data we use is cited.}
%    \item All the training details (e.g., data splits, hyperparameters, how they were chosen). [Yes] \textit{We discuss our fitting process in the Supplementary Materials.}
%          \item A clear definition of the specific measure or statistics and error bars (e.g., with respect to the random seed after running experiments multiple times). [Yes] \textit{Done.}
%          \item A description of the computing infrastructure used. (e.g., type of GPUs, internal cluster, or cloud provider). [Yes] \textit{We discuss computational requirements.}
%  \end{enumerate}

%  \item If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include:
%  \begin{enumerate}
%    \item Citations of the creator If your work uses existing assets. [Yes/No/Not Applicable] \textit{Cited in Supplementary Material.}
%    \item The license information of the assets, if applicable. [Yes]
%    \item New assets either in the supplemental material or as a URL, if applicable. [Yes]
%    \item Information about consent from data providers/curators. [Yes]
%    \item Discussion of sensible content if applicable, e.g., personally identifiable information or offensive content. [Not Applicable] \textit{We don't use sensitive material.}
%  \end{enumerate}

%  \item If you used crowdsourcing or conducted research with human subjects, check if you include:
%  \begin{enumerate}
%    \item The full text of instructions given to participants and screenshots. [Not Applicable] \textit{No crowdsorucing or human subjects used.}
%    \item Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Applicable] \textit{No crowdsorucing or human subjects used.}
%    \item The estimated hourly wage paid to participants and the total amount spent on participant compensation. [Not Applicable] \textit{No crowdsorucing or human subjects used.}
%  \end{enumerate}

%  \end{enumerate}

% \bibliographystyle{apalike}
% If your paper is accepted, change the options for the package
% aistats2022 as follows:
%
%\usepackage[accepted]{aistats2024}
%
% This option will print headings for the title of your paper and
% headings for the authors names, plus a copyright note at the end of
% the first column of the first page.

% If you set papersize explicitly, activate the following three lines:
%\special{papersize = 8.5in, 11in}
%\setlength{\pdfpageheight}{11in}
%\setlength{\pdfpagewidth}{8.5in}

% If you use natbib package, activate the following three lines:


% If your paper is accepted and the title of your paper is very long,
% the style will print as headings an error message. Use the following
% command to supply a shorter title of your paper so that it can be
% used as headings.
%
%\runningtitle{I use this title instead because the last one was very long}

% If your paper is accepted and the number of authors is large, the
% style will print as headings an error message. Use the following
% command to supply a shorter version of the authors names so that
% they can be used as headings (for example, use only the surnames)
%
%\runningauthor{Surname 1, Surname 2, Surname 3, ...., Surname n}

% Supplementary material: To improve readability, you must use a single-column format for the supplementary material.

\onecolumn
\appendix
\aistatstitle{
Supplementary Materials}
\section{COPULA BACKGROUND}\label{app:copulas}
% \vspace{-1cm}

Copulas present a powerful tool to model joint dependencies independent of the univariate margins. This aligns well with the requirements of the Frugal Parameterisation, where dependencies need to be varied without altering specified margins (the most critical being the specified causal effect). Understanding the constraints and limitations of copula models ensures that causal models remain accurate and consistent with the intended parameterisation.

\subsection{SKLAR'S THEOREM}
Sklar's theorem \citep{sklar1959,czado2019analyzing} provides the fundamental foundation for copula modelling by providing a bridge between multivariate joint distributions and their univariate margins. It allows one to separate the marginal behaviour of each variable from their joint dependence structure, with the latter being the copula itself.

\begin{theorem}
For a d-variate distribution function $F_{1:d} \in \mathcal{F}(F_1,\ldots,F_d)$, with $j^{\text{th}}$ univariate margin $F_j$, the copula associated with $F$ is a distribution function $C : [0,1]^d \rightarrow[0,1]$ with uniform margins on $(0,1)$ that satisfies
\begin{equation*}
    F_{1:d}(\bm{y}) = C(F_1(y_1),\dots,F_{d}(y_d)), \bm{y} \in \mathbf{R}^{d}.
\end{equation*}
\begin{enumerate}
    \item If F is a continuous d-variate distribution function with univariate margins $F_1,\dots, F_d$ and rank functions $F^{-1}_1,\dots, F^{-1}_d$ then
    \begin{equation*}
        C(\bm{u}) = F_{1:d}(F^{-1}_1(u_1),\dots,F^{-1}_d(u_d)), \bm{u}\in[0,1]^d.
    \end{equation*}
    \item If $F_{1:d}$ is a d-variate distribution function of discrete random variables (more generally, partly continuous and partly discrete), then the copula is unique only on the set
    \begin{equation*}
        Range(F_1) \times \dots \times Range(F_d).
    \end{equation*}
\end{enumerate}
The copula distribution is associated with its density $c(\cdot)$
\begin{equation*}
    f(\bm{y}) = c(F_1(y_1),\dots, F_d(y_d))\cdot f_1(y_1)\dots f_d(y_d)
\end{equation*}
where $f_i(\cdot)$ is the univariate density function of the $i^{\text{th}}$ variable.
\end{theorem}

Note that Sklar's theorem explicitly refers to the \textbf{univariate marginals} of the variable set $\{Y_1,\dots, Y_d\}$ to convert between the joint of univariate margins $C(\bm{u})$ and the original distribution $F(\bm{y})$. For absolutely continuous random variables, the copula function $C$ is unique. This uniqueness no longer holds for discrete variables, but this does not severely limit the applicability of copulas to simulating from discrete distributions.

An equivalent definition (from an analytical purview) is $C: [0, 1]^d \rightarrow [0, 1]$ is a $d$-dimensional copula if it has the following properties: 
\begin{enumerate}
    \item $C(u_1,\dots, 0, \dots, u_d) = 0$
    \item $C(1, \dots, 1, u_i, 1, \dots, 1) = u_i$.
    \item $C$ is $d$-non-decreasing.
\end{enumerate}
\begin{definition}
    A copula $C$ is $d$-non-decreasing if, for any hyperrectangle $H=\prod_{i=1}^{d}\left[u_i, y_i \right]\subseteq [0,1]^{d}$, the $C$-volume of $H$ is non-negative.
    \begin{equation*}
        \int_{H}C(\bm{u})~d\bm{u} \geq 0
    \end{equation*}
\end{definition}

\pagebreak  %%%%%%%%%%%%%%%%%% NEED TO KEEP THIS IN ORDER FOR SUPP MATERIAL TO BE RENDERED WELL.

\subsection{COPULAS FOR DISCRETE VARIABLES}\label{appsub:discrete-copulas}

\subsubsection{CHALLENGES AND MOTIVATIONS}\label{subsubsec:discrete-copula}
Modelling the dependency between discrete and mixed data is particularly challenging as copulas for discrete variables are not unique. Additionally, copulas encode a degree of ordering in the joint as probability integral transforms are inherently ranked, and hence should only be used for count or ordinal data models. We use the approach suggested by \citet{ruschendorf2009distributional}. An outline of this method is presented in \Cref{appsub:distribtional-transform}. 

\subsubsection{EMPIRICAL COPULA PROCESSES FOR DISCRETE VARIABLES}\label{appsub:distribtional-transform}
In order to deal with discrete variables, we use a the Generalised Distributional Transform of a random variable found originally proposed by \citet{ruschendorf2009distributional}. We quote the main result from \citet{ruschendorf2009distributional} below. 

\begin{theorem}
On a probability space $(\Omega, \mathcal{A}, P)$ let $X$ be a real random variable with distribution function $F$ and let $V \sim U(0, 1)$ be uniformly distributed on $(0, 1)$ and independent of $X$. The \textit{modified distribution function} $F(x, \lambda)$ is defined by
\begin{equation*}
F(x, \lambda) := P(X < x) + \lambda P(X = x).
\end{equation*}
We define the (generalised) \textit{distributional transform} of $X$ by
\begin{equation*}
U := F(X, V).
\end{equation*}
An equivalent representation of the distributional transform is
\begin{equation*}
U = F(X-) + V(F(X) - F(X-)).
\end{equation*}
\end{theorem}
\citet{ruschendorf2009distributional} makes a key remark about the generalised transform's lack of uniqueness for discrete variables. Such a dequantisation step may introduce artificial local dependence which may lead to an incorrect flow being inferred, and therefore hinder the inference of the causal margin.

\subsection{PAIR COPULA CONSTRUCTIONS AND VINE COPULAS}\label{app:vinecop}
Pair copula constructions (PCCs) provide a flexible framework for modelling multivariate dependence by decomposing a high-dimensional copula into a sequence of bivariate copulas~\citep{bedford2002}. A vine copula is a specific class of PCCs that employs a graphical model to structure these pairwise dependencies, extending traditional copulas to describe complex dependency structures in high-dimensional data. Vine copulas allow for flexible modelling of more complex conditional dependence structures, enabling a richer representation of statistical relationships. This flexibility makes vine copulas particularly useful when modelling more complex multivariate distributions where different pairwise interaction types and conditional dependencies must be specified~\citep{czado2022vine,czado2019analyzing}.
Vine copulas extend this concept by decomposing a multivariate copula into a sequence of bivariate copulas arranged in a hierarchical structure. This decomposition enables the flexible modelling of dependencies among variables while preserving computational tractability.

The hierarchical organisation of dependencies in vine copulas is achieved through a sequence of trees $\{T_1, T_2, \dots\, T_{K}\}$. Each tree consists of nodes and edges that represent variables and their dependencies, respectively. The first tree $T_1$ defines the marginal pairwise dependencies between variables. Each subsequent tree $T_k$ defines the dependencies conditional on the edges of the previous tree $T_{k-1}$. Each edge in $T_k$ is associated with a bivariate copula that models the conditional dependency between two variables. Mathematically, the joint density defined over a set of $d$ marginally uniform random variables, $c(u_1, \dots, u_d)$ of a vine copula can be expressed as:
\begin{equation}
c(u_1, \dots, u_d) = \prod_{k=1}^{d-1} \prod_{(i,j) \in E_k} c_{ij|D_{ij}}(u_i, u_j | u_{D_{ij}}),
\end{equation}
where $E_k$ represents the edges in the $k$-th tree, and $D_{ij}$ denotes the conditioning set for the pair $(i, j)$.

The flexibility of vine copulas lies in their ability to choose different bivariate families, specify tree structures, and control dependency strengths. Each pair of variables can be modelled using a specific bivariate copula family, such as Gaussian, Clayton, Gumbel, or Frank copulas. These families allow for capturing a variety of dependency types, including tail dependencies and asymmetry. The choice of tree structure determines the order in which dependencies are modelled. The parameters of the bivariate copulas can be adjusted to represent varying levels of correlation or dependency, and these parameters are estimated based on observed data or predefined assumptions.

The primary advantage of vine copulas is their ability to model complex dependency structures while preserving computational tractability. By decomposing a high-dimensional copula into a cascade of lower-dimensional components, vine copulas facilitate efficient sampling, parameter estimation, and inference. In our experimental framework, we leverage these properties to evaluate the impact of different dependency structures on causal inference generalisability.

% \subsection{EXPERIMENTAL FRAMEWORK AND SETUP}

% \subsubsection{NON-LINEAR CAUSAL MARGINS}\label{subsubapp:non-linear}
% The combination of parameterising both more complex vine copula models and marginal covariate densities allows users of our method to test algorithm generalisation to cases where dependencies are non-linear. we parameterised the causal margin using Gamma distributions. Specifically, the outcome model was defined as:
% \begin{equation*}
% Y \mid \operatorname{do}(X) \sim \text{Gamma}(k=0.5x + 0.1,~\theta=1).
% \end{equation*}
% Additionally, the univariate marginal densities of the covariates in the were all set to $\text{Gamma}(k=2,~\theta=1)$. The ability of models to generalise was evaluated by computing p-values and comparing linear regression models against flexible machine learning approaches.

% The results indicated that linear models consistently failed to generalise under nonlinear causal margins, as evidenced by uniformly low p-values. This highlights the importance of incorporating nonlinearity in practical applications where the true causal relationship may be highly complex.

% \subsubsection{SENSITIVITY TO COPULA MISSPECIFICATION}\label{subsubapp:cop-misspec}
% In this section we evaluate the robustness of our semi-synthetic framework to copula misspecification. Our primary goal is to assess whether the framework can distinguish between datasets generated with different copula structures and whether models trained under incorrect dependency assumptions demonstrate measurable differences in generalisability. 

% To this end, we simulate data under two distinct copula specifications. In both cases, we simulated data using a randomly sampled R-vine structure with five covariates. The first data generating process parameterised each bivariate as a Clayton copula with a parameter of 2. Clayton copulas have an asymmetric tail dependency, which a Gaussian copula cannot capture well. To test misspecification, we fit a Gaussian copula with correlations matched to the “true” Clayton copula.

% The results (TO BE ADDED) reveal differing test outcomes between the datasets, despite the only difference being the choice of copula family. These experiments serve two purposes. Firstly, they  show that a variety of different copula families can be chosen and fitted to data. Secondly, we demonstrate our framework allows for nuanced sensitivity analyses across subtly different dependencies, which is particularly important if the underlying data is heavy tailed and not appropriately modelled by a Gaussian copula.


% \section{GAUSSIAN COPULA WITH GAUSSIAN MARGINS}
% \label{sec:gaussian}
% In the main text, we generate synthetic data from a Gaussian copula with univariate Gaussian margins. The resultant joint multivariate density is a multivariate Gaussian. Consequently, any univariate density conditioned on all the other variables will be Gaussian, and the conditional mean is a linear function of the conditioning variables. The proof for the latter can be found in \citet{bishop2006pattern}. The proof for the former is provided below.

% \begin{theorem}
%     Let $\{Y_d\}_{d=1}^{D}$ be a set of $D$ univariate Gaussian random variables, where each $Y_d \sim \mathcal{N}(\mu_d, \sigma_d^2)$. Let $c(F_1(y_1), \dots, F_D(y_D))$ denote a multivariate Gaussian copula parameterized by a correlation matrix $\bm{R}$. The joint distribution of the random vector $\bm{Y} = (Y_1, Y_2, \dots, Y_D)^{T}$ is multivariate normal, specifically:
%     \[
%         \bm{Y} \sim \mathcal{N}(\bm{\mu}, \Sigma \bm{R} \Sigma),
%     \]
%     where $\bm{\mu} = (\mu_1, \mu_2, \dots, \mu_D)^{T}$ is the mean vector, and $\Sigma$ is a $D \times D$ diagonal matrix with $\Sigma_{ii} = \sigma_i$ for $i = 1, \dots, D$, and $\Sigma_{ij} = 0$ for $i \neq j$.
% \end{theorem}

% \begin{proof}
%     Consider a Gaussian copula with univariate Gaussian marginals $\{Y_1, \dots, Y_D\}$, where each $Y_d \sim \mathcal{N}(\mu_d, \sigma_d^2)$. Let the copula distribution function $C(\bm{u})$ be given by
%     \[
%     C(\bm{u}) = \Phi_D(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_D) \mid \bm{0}, \bm{R}),
%     \]
%     where $\Phi_D(\cdot \mid \bm{0}, \bm{R})$ is the CDF of the $D$-dimensional standard normal distribution with correlation matrix $\bm{R}$, and $\Phi(\cdot)$ is the CDF of the standard normal distribution. The corresponding density function is:
%     \[
%     f(\bm{y}) = c(F_1(y_1), \dots, F_D(y_D)) \prod_{d=1}^{D} f_d(y_d),
%     \]
%     where $f_d(y_d)$ is the density of $Y_d \sim \mathcal{N}(\mu_d, \sigma_d^2)$ and $c(F_1(y_1), \dots, F_D(y_D))$ is the copula density. To compute the copula density, we differentiate $C(\bm{u})$ with respect to $u_1, \dots, u_D$:
%     \[
%     c(\bm{u}) = \frac{\partial C(\bm{u})}{\partial u_1 \dots \partial u_D}.
%     \]
%     Using the Gaussian copula formula, we obtain:
%     \[
%     c(\bm{u}) = \frac{\phi_D(\Phi^{-1}(u_1), \dots, \Phi^{-1}(u_D) \mid \bm{0}, \bm{R})}{\prod_{d=1}^{D} \phi(\Phi^{-1}(u_d))},
%     \]
%     where $\phi_D(\cdot \mid \bm{0}, \bm{R})$ is the PDF of the multivariate normal distribution with mean zero and correlation matrix $\bm{R}$, and $\phi(\cdot)$ is the standard univariate normal PDF.

%     Next, recall that for any Gaussian random variable $Y_d \sim \mathcal{N}(\mu_d, \sigma_d^2)$, we have:
%     \[
%     u_d = F_d(y_d) = \Phi\left( \frac{y_d - \mu_d}{\sigma_d} \right).
%     \]
%     By Lemma \ref{lemma:gaussian-cdf} (below), the inverse CDF of the standard normal, $\Phi^{-1}(u_d)$, satisfies:
%     \[
%     \Phi^{-1}(F_d(y_d)) = \frac{y_d - \mu_d}{\sigma_d}.
%     \]
%     Therefore, substituting into the copula density, we get:
%     \[
%     c(F_1(y_1), \dots, F_D(y_D)) = \frac{\phi_D\left( \frac{y_1 - \mu_1}{\sigma_1}, \dots, \frac{y_D - \mu_D}{\sigma_D} \mid \bm{0}, \bm{R} \right)}{\prod_{d=1}^{D} \frac{1}{\sigma_d} \phi\left( \frac{y_d - \mu_d}{\sigma_d} \right)}.
%     \]

%     Now, combining this with the marginal densities, we obtain the joint density:
%     \[
%     f(\bm{y}) = \phi_D\left( \frac{y_1 - \mu_1}{\sigma_1}, \dots, \frac{y_D - \mu_D}{\sigma_D} \mid \bm{0}, \bm{R} \right) \prod_{d=1}^{D} \frac{1}{\sigma_d}.
%     \]
%     Finally, multiplying by the product of the univariate densities $f_d(y_d)$ gives:
%     \[
%     f(\bm{y}) = \phi_D(\bm{y} \mid \bm{\mu}, \Sigma \bm{R} \Sigma),
%     \]
%     which is the PDF of a multivariate Gaussian distribution with mean vector $\bm{\mu}$ and covariance matrix $\Sigma \bm{R} \Sigma$. Hence, the joint distribution of $\bm{Y}$ is multivariate normal, as desired.
% \end{proof}

% \begin{lemma}\label{lemma:gaussian-cdf}
%     Let $\Phi^{-1}(\cdot)$ denote the inverse CDF of the standard normal distribution. For a Gaussian random variable $X \sim \mathcal{N}(\mu, \sigma^2)$, we have:
%     \[
%     \Phi^{-1}(F_X(x)) = \frac{x - \mu}{\sigma}.
%     \]
% \end{lemma}

% \begin{proof}
%     This follows by noting that $F_X(x) = \Phi\left( \frac{x - \mu}{\sigma} \right)$, and thus:
%     \[
%     \Phi^{-1}(F_X(x)) = \Phi^{-1}\left( \Phi\left( \frac{x - \mu}{\sigma} \right) \right) = \frac{x - \mu}{\sigma}.
%     \]
% \end{proof}


% \section{DERIVING UNIFORMLY MARGINAL RANKS USING A GAUSSIAN COPULA}

% In this section we outline the circumstances by where two different sets of marginal covariate distributions may yield the same marginal causal densities when assuming that $\hat{c}_{\YIZbdX}$ is a conditional copula density derived from a Gaussian copula. First and foremost, we want to emphasize that this is a rather strict scenario, and it is less likely to occur in real-world settings.

% This assumes that the ranks of the marginal causal model are distributed as follows:
% \begin{equation}\label{eq:cond-gauss-cop}
%     \Phi^{-1}(u_{\YIdX}) \mid \Phi^{-1}(u_{Z_1}), \dots, \Phi^{-1}(u_{Z_D}) \sim \mathcal{N}\left( \sum_{d=1}^{D} \beta_{d}\Phi^{-1}(u_{Z_d}),~1 - \sum_{d=1}^{D}\beta_{d}^{2}\right),
% \end{equation}
% which assures that the marginal distribution of $\Phi^{-1}(u_{\YIdX}) \sim \mathcal{N}(0,1)$ if $\{\Phi^{-1}(u_{Z})\}_{i}$. Given \Cref{eq:cond-gauss-cop}, our question is whether there is another set of conditioning variables which yields the same marginal outcome of the conditional model.

% We can rewrite \Cref{eq:cond-gauss-cop} as a linear combination of Gaussians:
% \begin{align}
%     \Phi^{-1}(u_{\YIdX}) = \sum_{d=1}^{D} \beta_{d} T_{d} + \epsilon
% \end{align}
% where $\epsilon \sim \mathcal{N}\left(0, 1 - \sum_{d=1}^{D}\beta_{d}^{2}\right)$, and $\{T_{d}\}_{d=1}^D$ are an arbitrary set of conditioning variables. If the marginal distribution of $\Phi^{-1}(u_{\YIdX})$ is Gaussian, then $\{T_{d}\}_{d=1}^D$ must each be Gaussian (Gaussian closure under linear marginalisation).

% Our next question is finding which linear transformations of $\{T_{d}\}_{d=1}^D$ will yield a standard Gaussian distribution of $\Phi^{-1}(u_{\YIdX})$. Assume that $\{T_{d}\}_{d=1}^D$ yields a marginal distribution of $\Phi^{-1}(u_{\YIdX})$ which is standard Gaussian. Let us perform the change of variables transformation
% \begin{equation*}
%     W_{d} = \alpha_{d} T_{d} + \mu_{d}, ~\forall~d=\{1,\dots,D\}
% \end{equation*}
% where $\alpha_{d}$ and $\mu_{d}$ are all constants. Our goal is to identify a set of conditions for $\{(\alpha, \mu)\}_{d}$ whereby 
% \begin{equation*}
%     \mathbb{E}\left[\sum_{d=1}^{D} \beta_{d} W_{d} + \epsilon \right] = 0 \qquad \text{and} \qquad \mathbf{Var}\left[ \sum_{d=1}^{D} \beta_{d} W_{d} + \epsilon \right] = 1.
% \end{equation*}
% Starting with the expectation,
% \begin{align}
%     \mathbb{E}\left[\sum_{d=1}^{D} \beta_{d} W_{d} + \epsilon \right] &= \mathbb{E}\left[\sum_{d=1}^{D} \alpha_{d} \beta_{d} T_{d} + \sum_{d=1}^{D} \beta_{d} \mu_{d} \right] \\
%     &= \sum_{d=1}^{D} \beta_{d} \mu_{d}.
% \end{align}
% Similarly for the variance,
% \begin{align}
%     \mathbf{Var}\left[\sum_{d=1}^{D} \beta_{d} W_{d} + \epsilon \right] &= \mathbf{Var}\left[\sum_{d=1}^{D} \alpha_{d} \beta_{d} T_{d} + \sum_{d=1}^{D} \beta_{d} \mu_{d} \right] + \mathbf{Var}[\epsilon] \\
%     &= \sum_{d=1}^{D} (\alpha_{d}\beta_{d})^{2} + 1 - \sum_{d=1}^{D}\beta_{d}^{2}.
% \end{align}
% The set of variables by which we can exactly sample from the same marginal effect are if
% \begin{equation*}
%     W_{d} \sim \mathcal{N}(\mu_{d}, \alpha_{d}^2 )
% \end{equation*}
% for any $\{(\mu_{d}, \alpha_{d})\}, d\in [1,\ldots, D]$ if 
% \begin{equation*}
%     \sum_{d=1}^{D} \beta_{d}\mu_{d} = 0 \quad \text{and} \quad \sum_{d=1}^{D}(\alpha_{d}\beta_{d})^{2} = \sum_{d=1}^{D} \alpha_{d}^{2}.
% \end{equation*}
% This is indeed an extreme case. Given how rarely these conditions are satisfied, especially in high-dimensional settings where the copula function can become quite complex, it is not a significant concern for our work.


\section{MODELS}

We provide details of the models evaluated in our paper.

\paragraph{Engression} Engression, proposed in \cite{shen2023engression}, approximates the conditional distribution $P\left(Y\mid X\right)$ using a pre-additive noise model $Y = g(WX + \eta) + \beta^\top X$, where $g: \mathbb{R}^d \rightarrow \mathbb{R}$ is a non-linear function that captures non-linear relationships and $\eta = h(\epsilon)$ introduces flexible noise. Built on the neural network architecture that efficiently learns this structure, it optimizes the energy score loss for accurate distributional regression.

\paragraph{Meta-learners}
Meta-learners are flexible frameworks in causal inference designed to estimate individualized treatment effects by leveraging machine learning models. Two common types are T-learners and S-learners. Details can be found in \cite{kunzel2019metalearners}.

T-learners work by training separate models for the treated and untreated groups, predicting outcomes under each treatment condition, and then calculating the difference between these predictions to estimate the treatment effect.
S-learners combine both treated and untreated data into a single model by including treatment as an input feature, allowing the model to learn the outcome function across both treatment conditions simultaneously.
These learners provide a modular approach to estimating Conditional Average Treatment Effects (CATE) and can adapt to different settings and model complexities.
\paragraph{CausalForest}

CausalForest is an extension of random forests designed to estimate heterogeneous treatment effects by partitioning the data into subgroups with similar treatment responses. Introduced by \cite{wager2018estimation}, CausalForest uses a tree-based ensemble method to non-parametrically estimate Conditional Average Treatment Effects (CATE) by building separate models for different covariate regions, while ensuring a balance between treated and control units in each partition. This method is flexible and adapts to complex data structures, making it a powerful tool for understanding treatment effect heterogeneity.

\paragraph{BART} BART (Bayesian Additive Regression Trees), first introduced in  \cite{chipman2010bart}, is a non-parametric machine learning method that uses an ensemble of regression trees to model complex relationships between covariates and outcomes.  The BART model estimates the posterior distribution of the outcome by summing the contributions from many trees, each of which is trained to explain part of the residual error left by the others. This ensemble approach makes BART particularly effective at capturing complex, non-linear relationships between the covariates and the outcome. Unlike standard decision trees, BART applies a Bayesian framework, allowing it to quantify uncertainty in its predictions and avoid overfitting through regularization priors.

\paragraph{TARNet} TARNet (Treatment-Agnostic Representation Network), first introduced in \cite{johansson2016learning}, is a neural network-based model for estimating heterogeneous treatment effects in causal inference. It works by learning a shared representation of covariates, independent of treatment assignment, and then using this representation to estimate potential outcomes for both the treated and untreated groups. By focusing on treatment-agnostic representation learning, TARNet aims to improve the generalizability and accuracy of treatment effect estimates, particularly in high-dimensional settings.

\section{COMPUTATION DETAILS}
We provide computation details in the Experiment section. We use default recommended hyperparameters for each model.

\begin{table}[h]
\caption{Hyperparameters of Each Model} 
\label{tab:hyperparameter}
\begin{center}
\begin{tabular}{|l|p{8cm}|p{5cm}|}
\hline
\textbf{Model} & \textbf{Key Hyperparameters} & \textbf{Package} \\
\hline
TARNet & Number of layers = 2, batch size = 64, learning rate = 0.0001, number of epochs = 2000 & Python, \texttt{catenets} \citep{curth2021really} \\
\hline
CausalForest & Number of trees = 100, maximum depth = 3 & Python, \texttt{econml} \citep{econml} \\
\hline
S-/T-BART & Number of trees = 75, number of iterations = 4, number of burn-in iterations = 200, posterior draws = 800 & R, \texttt{dbarts} \citep{dbarts} \\
\hline
S-/T-engression & Number of layers = 3,   batch size = 64, learning rate = 0.01, number of epochs = 500 & Python, \texttt{engression}, \citep{engression}\\
\hline
\end{tabular}
\end{center}
\end{table}

All experiments were conducted on a MacBook with an Apple M3 chip, 8-core CPU, and 32GB RAM. The codes can be found in TestGeneralizability.zip.


\section{ADDITIONAL EXPERIMENTS}
\label{sec:additional_exp}
\subsection{TESTING GENERALIZABLE MODELS}
\label{sec:linear}
We include an additional experiment we run in this section, which is based on the synthetic data setting in the main text, but without domain shift. We set the marginal distribution of $Z_1$, $Z_2$ to be $\mathcal{N}(1,1)$, and $Y(X) \sim \mathcal{N}(2X+1,1)$, $X\sim \operatorname{Bernoulli} (0.5)$. In this case, the CATE should be linear. 

Result of when there is no domain shift can be found in \Cref{fig:synthetic_mean_p_noshift}. We see that the p-values of both S-LinearRegression and T-LinearRegression are uniformly distributed. Given the true CATE function is indeed linear, this result validates our proposed method.


\begin{figure}[h!]
\vspace{.3in}
\centerline{\includegraphics[width=0.5\linewidth]{synthetic_mean_p_noshift.png}}
\vspace{.3in}
\caption{$p$-values of Mean Regression Testing, Synthetic Data of 50 Iterations, No Domain Shift.}
\label{fig:synthetic_mean_p_noshift}
\end{figure}

We next test when there is domain shift, i.e., we keep all the settings the same as above for training set, but we change the marginal distribution of $Z_1$, $Z_2$ in the test set to be $\mathcal{N}(3,2)$. \Cref{fig:synthetic_mean_p_shift} shows the results. Linear regressions still demonstrate good generalizability performance! However foralgorithms like S-engression and S-BART the results worsen, likely due to problems such as overfitting.

\begin{figure}[h!]
\vspace{.3in}
\centerline{\includegraphics[width=0.5\linewidth]{synthetic_mean_p_shift_linear.png}}
\vspace{.3in}
\caption{$p$-values of Mean Regression Testing, Synthetic Data of 50 Iterations, with Domain Shift.}
\label{fig:synthetic_mean_p_shift}
\end{figure}

\subsection{MORE COMPLICATED DATA GENERATION}
 To demonstrate the flexibility of our approach, we run additional experiments across different data generation settings, including increasing number of covariates, changing marginal distributions and changing dependency structures.
 
Below are p-value statistics from 50 trials conducted under a setting similar to Synthetic Setting 1 in the main body of our paper, with only two changes: (1) We increase the number of covariates from 2 to 50 and 100, and keep the covariate distribution shifts the same for each covariate; (2) We replace the dependency structures with randomly sampled correlation matrices.

\begin{table}[h!]
\centering

\begin{tabular}{rrrrrrr} \hline Model & Min & 25\% & Median & Mean & 75\% & Max \\ \hline TARNet & $-36.8$ & $-33.6$ & $-32.4$ & $-31.6$ & $-31.5$ & $-30.8$ \\ CausalForest & $-2.35$ & $-1.22$ & $-0.851$ & $-0.539$ & $-0.214$ & $-0.077$ \\ S-BART & $-8.32$ & $-4.22$ & $-3.36$ & $-2.58$ & $-2.66$ & $-1.92$ \\ T-BART & $-1.52$ & $-0.757$ & $-0.326$ & $-0.349$ & $-0.187$ & $-0.044$ \\ S-engression & $-20.0$ & $-18.2$ & $-17.3$ & $-14.4$ & $-16.7$ & $-13.1$ \\ T-engression & $-2.53$ & $-0.669$ & $-0.283$ & $-0.324$ & $-0.211$ & $-0.006$ \\ \hline \end{tabular}
\caption{$\operatorname{log_{10}}$p-value statistics under synthetic setting 1 with 50 covariates.}
\end{table}

\begin{table}[h!]
\centering

\begin{tabular}{rrrrrrr} \hline Model & Min & 25\% & Median & Mean & 75\% & Max \\ \hline TARNet & $-34.3$ & $-31.3$ & $-30.9$ & $-29.9$ & $-30.3$ & $-29.0$ \\ CausalForest & $-2.39$ & $-1.35$ & $-0.821$ & $-0.670$ & $-0.479$ & $-0.122$ \\ S-BART & $-9.62$ & $-7.32$ & $-6.68$ & $-5.22$ & $-5.99$ & $-3.96$ \\ T-BART & $-1.04$ & $-0.76$ & $-0.36$ & $-0.31$ & $-0.12$ & $0.00$ \\ S-engression & $-28.6$ & $-26.1$ & $-25.3$ & $-23.6$ & $-24.0$ & $-22.6$ \\ T-engression & $-2.46$ & $-0.663$ & $-0.393$ & $-0.366$ & $-0.137$ & $-0.107$ \\ \hline \end{tabular}
\caption{$\operatorname{log_{10}}$p-value statistics under synthetic setting 1 with 100 covariates.}
\end{table}
CausalForest, T-engression and T-BART demonstrate good generalizability in these settings.

\cref{tab:non-linear} present the $\operatorname{log_{10}}$p-value statistics from 50 trials under the same setup as the first experiment in \Cref{sec:linear} except for altering the marginal causal distribution. Changing this from Gaussian to Gamma introduces non-linear dependencies in the conditional causal margin. While linear regression was generalizable in the original setup, it fails in the non-linear setting, demonstrating the ability of our approach to show that some methods fail to generalize well.

\begin{table}[h!]
\centering

\begin{tabular}{rrrrrrr} \hline Model & Min & 25\% & Median & Mean & 75\% & Max \\ \hline S-Linear & $-16.2$ & $-13.0$ & $-11.7$ & $-10.9$ & $-10.8$ & $-9.93$ \\ 
T-Linear & $-13.3$ & $-11.2$ & $-10.6$ & $-9.67$ & $-10.0$ & $-8.61$ \\ 
TARNet & $-24.0$ & $-21.7$ & $-21.2$ & $-19.5$ &$ -19.9$ & $-18.6$ \\ 
CausalForest & $-12.2$ & $-10.8$ & $-10.2$ & $-9.16$ & $-9.34$ & $-8.23$ \\ 
S-BART & $-13.4$ & $-10.3$ & $-9.33$ & $-7.60$ & $-8.50$ & $-6.36$ \\ 
T-BART & $-11.6$ & $-8.40$ & $-7.84$ & $-6.38$ & $-7.49$ & $-5.12$ \\ 
S-engression & $-12.5$ & $-9.89$ & $-9.45$ & $-8.01$ & $-8.17$ & $-7.13$ \\ 
T-engression & $-9.69$ & $-7.32$ & $-6.83$ & $-5.23$ & $-6.32$ & $-3.99$ \\ \hline \end{tabular}
\label{tab:pvalue_statistics}
\caption{$\operatorname{log_{10}}$p-value statistics under the same set-up with first experiment in \Cref{sec:linear} with non-linear dependency. }
\label{tab:non-linear}
\end{table}

A strength of our framework is that vine copula allows users to test their methods against various classes of copulas. We demonstrate this in \Cref{tab:non-gaussian_copula} with the following data generating process:

\begin{itemize}
    \item Training Domain: Covariates' marginal distributions are identical Gamma distributions with shape $k=8$ and rate $\theta=4$;
    \item Testing Domain: Covariates' marginal distributions are identical Gamma distributions with shape $k=2$ and rate $\theta=1$;
    \item Marginal Causal Distribution: Modeled as a Exponential distribution with $k=0.5x+0.1$;
    \item Treatment Assignment: Specified as $ X\sim \operatorname{Bernoulli} (0.5)$.
    \item Copula: Randomly sampled R-vine structure, with each bivariate copula set to be a Clayton copula ~\citep{kreinovich2013clayton} with a parameter of 2. 
\end{itemize}

\begin{table}[h!]
\centering
\begin{tabular}{lcccccc} \hline Model & Min & 25\% & Median & Mean & 75\% & Max \\ \hline S-Linear & $-\infty$ & $-5.71$ & $-4.92$ & $-3.89$ & $-4.15$ & $-2.81$ \\ T-Linear & $-\infty$ & $-3.47$ & $-2.64$ & $-1.87$ & $-1.94$ & $-0.929$ \\ TARNet & $-18.3$ & $-15.9$ & $-14.5$ & $-12.5$ & $-13.7$ & $-11.2$ \\ CausalForest & $-10.9$ & $-3.53$ & $-2.81$ & $-2.35$ & $-2.23$ & $-1.49$ \\ S-BART & $-\infty$ & $-4.12$ & $-3.47$ & $-2.91$ & $-2.99$ & $-2.04$ \\ T-BART & $-\infty$ & $-4.14$ & $-3.24$ & $-2.62$ & $-2.62$ & $-1.73$ \\ S-engression & $-15.8$ & $-4.06$ & $-3.04$ & $-2.35$ & $-2.63$ & $-1.54$ \\ T-engression & $-10.2$ & $-3.70$ & $-2.28$ & $-1.95$ & $-1.70$ & $-1.40$ \\ \hline \end{tabular}
\caption{$\operatorname{log_{10}}$p-value statistics for experiment with non-Gaussian copula. The $-\infty$ is due to the original p-values being 0.}
\label{tab:non-gaussian_copula}
\end{table}



  \Cref{tab:gaussian_copula} shows the p-values of testing generalizability results with data generated from a Gaussian copula. The covariate margins, the causal margins, the dependency structure, and the second moments of each bivariate copula is identical to the previous example. We set the rank correlation coefficient of the Gaussian copula, $
\rho = \frac{\theta}{2+\theta}$, where $\theta$ parameterized the Clayton copula, which we set as 2 in the previous example. The only difference between the two processes is the class of the copula family. 

\begin{table}[h!]
\centering
\begin{tabular}{lcccccc} \hline Model & Min & 25\% & Median & Mean & 75\% & Max \\ \hline S-Linear & $-\infty$ & $-5.19$ & $-4.68$ & $-3.52$ & $-3.49$ & $-2.73$ \\ T-Linear & $-\infty$ & $-2.59$ & $-1.83$ & $-1.42$ & $-1.45$ & $-0.512$ \\ TARNet & $-21.3$ & $-15.6$ & $-14.3$ & $-13.5$ & $-13.6$ & $-12.7$ \\ CausalForest & $-5.45$ & $-3.81$ & $-2.94$ & $-1.70$ & $-2.01$ & $-0.517$ \\ S-BART & $-\infty$ & $-3.48$ & $-2.85$ & $-2.13$ & $-2.09$ & $-1.20$ \\ T-BART & $-\infty$ & $-2.85$ & $-2.17$ & $-1.74$ & $-1.44$ & $-1.07$ \\ S-engression & $-11.4$ & $-3.16$ & $-2.62$ & $-1.82$ & $-1.68$ & $-1.10$ \\ T-engression & $-9.24$ & $-2.61$ & $-1.51$ & $-1.08$ & $-0.921$ & $-0.322$ \\ \hline \end{tabular}
\caption{$\operatorname{log_{10}}$p-values for experiment with the same setting as in \Cref{tab:non-gaussian_copula}, but Gaussian coupla.}
\label{tab:gaussian_copula}
\end{table}

Contrasting tables x and y shows that model generalizability is sensitive to copula families. Therefore, the flexibility of simulating data from different copula families, which is a key advantage of our framework, is important for model generalizability evaluation.



% The results in \Cref{tab:non-gaussian_copula} reveal differing test outcomes between the datasets, despite the only difference being the choice of copula family. These experiments serve two purposes. Firstly, they  show that a variety of different copula families can be chosen and fitted to data. Secondly, we demonstrate our framework allows for nuanced sensitivity analyses across subtly different dependencies, which is particularly important if the underlying data is heavy tailed and not appropriately modelled by a Gaussian copula.
% We illustrate the benefits of such flexibility by conducting a sensitivity analysis of the algorithms tested in this paper to different copula families. The marginal densities and the copula tree structure are kept the same. We simulated data using a randomly sampled R-vine structure for with five covariates. We set each bivariate copula to a Clayton copula with a parameter of 2, highlighting its asymmetric tail dependency, which a Gaussian copula cannot capture well~\citep{kreinovich2013clayton}.

\section{INTERPRETING TESTING RESULTS}
We further explain the motivation of our paper, as will as guidance of reading the testing results.

All p-values, including their distributions, are highly informative in evaluating generalizability. For example, consistently small p-values (as shown in \Cref{fig:ihdp_mean}), indicate a clear failure of model generalizability in that setting. Conversely, uniform distributions of p-values (e.g., linear regression results in \Cref{fig:synthetic_mean_p_noshift}) demonstrate more trust in the model’s generalizability. Type I error control serves a critical role in distinguishing between competing hypotheses with a minimal probability of error. In our framework, controlling Type I error ensures that conclusions about non-generalization when a model fails the test are not driven by random noise. This rigor is crucial for causal inference, where decisions based on incorrect conclusions can have significant consequences. In contrast, predictive performance measures like MSE lack statistical safeguards, and interpretations of model performance under domain shifts would lack reliability and robustness.


We also provide explanations if all tests fail. As with any hypothesis test, failing to pass provides evidence against the tested hypothesis. In our framework, this means the algorithm lacks sufficient generalizability to infer the conditional treatment margin in new domains. If all algorithms fail, it signals none are suitable for reliable causal inference under the domain shift.

This highlights the need for alternative modeling approaches and underscores the value of our framework. Unlike MSE, which compares predictive performance, our method directly identifies failures in causal generalizability—an essential insight for researchers. We hope this clarifies how to interpret such results and guides researchers in determining next steps when all models fail.
% \newpage
% \bibliography{references}
% \vfill

\end{document}
