\documentclass[twoside]{article}

\usepackage{aistats2025}
% algorithm
\usepackage{algorithm}
% \usepackage[noend]{algpseudocode}
\usepackage[noend]{algorithmic}
% \usepackage{algorithmic}
\usepackage{amsfonts}
\usepackage{amsmath,bm}
\usepackage{bbold}
\usepackage{graphicx}
\usepackage{cleveref}
\usepackage{amsthm}

\newtheorem{theorem}{Theorem}
\newtheorem{definition}{Definition}
\newtheorem{lemma}[theorem]{Lemma}

\newcommand{\RR}{I\!\!R} %real numbers
\newcommand{\Nat}{I\!\!N} %natural numbers
\newcommand{\CC}{I\!\!\!\!C} %complex numbers

\newcommand{\YZIX}{{Y\hspace{-1pt}Z \hspace{-.5pt} | \hspace{-.5pt}X}}
\newcommand{\ZYIX}{{Z\hspace{-1pt}Y \hspace{-.5pt} | \hspace{-.5pt}X}}
\newcommand{\ZbYIX}{{\bm{Z}\hspace{-.25pt}Y \hspace{-.5pt} | \hspace{-.5pt}X}}
\newcommand{\ZYIdX}{{Z\hspace{-1pt}Y \hspace{-.5pt} | \hspace{-.5pt}\text{do}(X)}}
\newcommand{\ZbYIdX}{{\bm{Z}\hspace{-.5pt}Y \hspace{-.5pt} | \hspace{-.5pt}\text{do}(X)}}
\newcommand{\ZbYx}{{\bm{Z}\hspace{-.5pt}Y(x) \hspace{-.5pt}}}
\newcommand{\YIZbdX}{{\hspace{-.5pt}Y \hspace{-.5pt} | \hspace{-.5pt} \bm{Z}\hspace{-1pt}, \hspace{-0.5pt} \text{do}(X)}}
\newcommand{\YxIZb}{{\hspace{-.5pt}Y(x) \hspace{-.5pt} | \hspace{-.5pt} \bm{Z}}}
\newcommand{\YIZdX}{{\hspace{-.5pt}Y \hspace{-.5pt} | \hspace{-.5pt} Z\hspace{-1pt}, \hspace{-0.5pt} \text{do}(X)}}
\newcommand{\YZIXC}{{Y\hspace{-1pt}Z \hspace{-.5pt} | \hspace{-.5pt}X\hspace{-.5pt}C}}
\newcommand{\YcolonZdX}{{Y\hspace{-1pt};\bm{Z} \hspace{-.5pt} | \hspace{-.5pt}\text{do}(X)}}
\newcommand{\YIXZ}{{Y \hspace{-.5pt} | \hspace{-.5pt}X \hspace{-1pt}Z}}
\newcommand{\YIZX}{{Y \hspace{-.5pt} | \hspace{-.5pt}Z \hspace{-1pt}X}}
\newcommand{\YIZbX}{{Y \hspace{-.5pt} | \hspace{-.5pt}\bm{Z} \hspace{-.5pt}X}}
\newcommand{\ZXY}{{\hspace{-.5pt}Z \hspace{-1pt} X \hspace{-.75pt} Y}}
\newcommand{\ZbXY}{{\hspace{-.5pt}\bm{Z} \hspace{-1pt} X \hspace{-.75pt} Y}}
\newcommand{\XY}{{X \hspace{-.75pt} Y}}
\newcommand{\ZX}{{\hspace{-.5pt} Z \hspace{-1pt} X}}
\newcommand{\YIZ}{Y \hspace{-.5pt} | \hspace{-.75pt} Z}
\newcommand{\YZ}{Y \hspace{-1pt} Z}
\newcommand{\YIX}{{Y\hspace{-.5pt}|\hspace{-.5pt} X}}
\newcommand{\ZIX}{{Z \hspace{-.5pt}|\hspace{-.5pt} X}}
\newcommand{\ZbIX}{{\bm{Z} \hspace{-.5pt}|\hspace{-.5pt} X}}
\newcommand{\XIZ}{{\hspace{-.75pt} X \hspace{-.5pt}|\hspace{-.5pt} Z}}
\newcommand{\XIZb}{{\hspace{-.75pt} X \hspace{-.5pt}|\hspace{-.5pt} \bm{Z}}}
\newcommand{\indep}{\rotatebox[origin=c]{90}{$\models$}}

\newcommand{\cmid}{\,|\,}


% If your paper is accepted, change the options for the package
% aistats2024 as follows:
%
%\usepackage[accepted]{aistats2024}
%
% This option will print headings for the title of your paper and
% headings for the authors names, plus a copyright note at the end of
% the first column of the first page.

% If you set papersize explicitly, activate the following three lines:
%\special{papersize = 8.5in, 11in}
%\setlength{\pdfpageheight}{11in}
%\setlength{\pdfpagewidth}{8.5in}

% If you use natbib package, activate the following three lines:
\usepackage[round]{natbib}
\renewcommand{\bibname}{References}
\renewcommand{\bibsection}{\subsubsection*{\bibname}}

% If you use BibTeX in apalike style, activate the following line:
\bibliographystyle{apalike}

\begin{document}

% If your paper is accepted and the title of your paper is very long,
% the style will print as headings an error message. Use the following
% command to supply a shorter title of your paper so that it can be
% used as headings.
%
%\runningtitle{I use this title instead because the last one was very long}

% If your paper is accepted and the number of authors is large, the
% style will print as headings an error message. Use the following
% command to supply a shorter version of the authors names so that
% they can be used as headings (for example, use only the surnames)
%
%\runningauthor{Surname 1, Surname 2, Surname 3, ...., Surname n}

\twocolumn[

\aistatstitle{Testing Generalizability in Causal Inference}

\aistatsauthor{ Author 1 \And Author 2 \And  Author 3 }

\aistatsaddress{ Institution 1 \And  Institution 2 \And Institution 3 } ]

\begin{abstract}
Ensuring robust model performance across diverse real-world scenarios requires addressing both transportability across domains with covariate shifts and extrapolation beyond observed data ranges. However, no formal procedure exists for statistically evaluating generalizability in machine learning algorithms. Existing methods often rely on arbitrary metrics like AUC or MSE and focus predominantly on toy datasets, providing limited insights into real-world applicability. To address this gap in the domain of causal inference, we propose a systematic and quantitative framework for evaluating model generalizability under covariate distribution shifts. Our approach uses the frugal parameterization, allowing for flexible simulations from fully and semi-synthetic benchmarks, offering comprehensive evaluations for both mean and distributional regression methods. By basing simulations on real data, our method ensures more realistic evaluations, which is often missing in current work relying on simplified datasets. Furthermore, using simulations and statistical testing, our framework is robust and avoids over-reliance on conventional metrics. Grounded in real-world data, it provides realistic insights into model performance, bridging the gap between synthetic evaluations and practical applications.
\end{abstract}

\section{INTRODUCTION}

Algorithm generalizability has garnered significant interest in fields such as computer vision and natural language processing. It encompasses both transportability under covariate shifts between domains and extrapolation, where predictions are made within the same population but beyond the observed data range or in underrepresented subgroups.

Generalizability has also become a central focus in causal inference \citep{bareinboim2016causal,curth2021really, johansson2018learning, buchanan2018generalizing,ling2022critical,bica2022transfer}. Here, it refers to the ability of a causal model to make accurate causal predictions or draw valid causal conclusions when applied to data from a domain or distribution other than the one it was trained on. This concept is crucial when the objective involves understanding and predicting the effects of interventions across various settings. These settings may significantly diverge from the original conditions under which the model was developed, presenting challenges due to variations in environment, demographics, or other external influences. This holds particular importance in clinical settings, where the growing interest in personalized treatment and patient stratification underscores the need for 
% methods that
% that allow for 
inferences to generalize across diverse population domains.

Although strategies for improving generalization have been widely explored \citep{zhou2022domain,yu2024survey}, there has been comparatively little focus on developing a comprehensive, structured framework for evaluating generalizability. A common approach is to measure generalization or extrapolation performance using metrics like AUC for classification or MSE for regression. However, these metrics often lack informativeness. Achieving an MSE of 5, compared to 10 from other methods, on synthetic data irrelevant to the user's intended application, does not provide clear guarantees regarding real-world performance. Therefore, it is essential to establish a systematic evaluation framework based on simulation for generalizability performance, which can offer a more robust and comprehensive understanding of how these methods perform on relevant tasks.

This paper proposes a method to statistically evaluate the generalizability of causal inference algorithms under covariate and treatment distribution shifts. We introduce a semi-synthetic simulation framework using two domains---training (A) and testing (B)---that share the same intervened conditional outcome distributions but potentially differ in covariate and treatment distributions. A model is trained on domain A to \textbf{learn the shared high dimensional conditional outcome distribution}. We test the model's generalizability by estimating the marginal causal quantities in domain B, where these values are explicitly known. This approach simplifies the evaluation process by reducing the complexity from higher-dimensional intervened models to a lower-dimensional causal effect, enabling more powerful statistical testing.
% With our method, we are able to derive the explicit, known values of these marginal quantities. We thus assess the generalizability of the trained model by constructing estimations and true marginal quantities. This approach simplifies the evaluation from a higher-dimensional intervened model to lower-dimensional marginal quantities, facilitating statistical testing.

A high-level overview of the workflow of our method:
\begin{enumerate}
    \item \textbf{Learn both the distribution parameters of two domains, and the Conditional Outcome Distribution (COD) from real-world data}: Define two domains, domain A and domain B, of which the covariate and treatment distributions differ, but the COD is the same. These distributions can be learned empirically from real-world data, rather than just being limited to specifying parametric models.
    
    \item \textbf{Model training}: Simulate semi-synthetic data from domain A using the distributions fitted on data in step 1. Train a conditional effect model on the simulated data.
    
    \item \textbf{Prediction/Estimaton}: Simulate data from domain B, whose covariate and treatment distributions may differ from domain A, but with identical COD. Apply the trained model on the sampled covariates and treatments from domain B and estimate marginal causal quantities outcome predictions from the model.
    \item \textbf{Evaluate generalizability with statistical testing}: Statistically test whether the estimated marginal causal quantities deviate significantly from the known ground truth in domain B. This provides an evaluation of the model's generalizability under covariate and treatment distribution shifts. The tests assess whether the model generalizes effectively by focusing on lower-dimensional quantities instead of high-dimensional conditional outcome models.
\end{enumerate}

% A high-level overview of the workflow of our method is as follows. (1) We first define two domains, domain A (training) and domain B (testing), of which the covariate and treatment distributions can be different, but the COD is the same. These distributions can be learned empirically from real-world data, rather than just limited to specifying parametric models. (2) Next, we simuluate semi-synthetic data from domain A using pre-specified distributions. Train a conditional effect model on the simulated data. (3) We then simulate data from domain B, whose covariate and treatment distributions may differ from domain A, but with an identical COD. Apply the trained model on the sampled covariates and treatments from domain B and estimate marginal causal quantities outcome predictions from the model. (4) Finally, we statistically test whether the estimated marginal causal quantities deviate significantly from the known ground truth in domain B. This provides an evaluation of the model's generalizability under covariate and treatment distribution shifts. The tests assess whether the model generalizes effectively by focusing on lower-dimensional quantities (marginal causal distributions) instead of high-dimensional conditional outcome models.

\paragraph{Main Contributions} In this work, we propose a formal framework for statistically testing the generalizability of machine learning algorithms under covariate and treatment distribution shifts, specifically in the context of causal inference. Rather than relying on arbitrary metrics, we provide tests that statistically evaluate the transportability of both mean and distributional regression methods. 
% Our approach is built on \textbf{frugal parameterization}\cite{evans2024parameterizing}, enabling simulations from various data-generating processes as well as real-world data.  
% In real applications, generalizability is particularly dependent on key properties from real data. For instance, sample size may affect performances of algorithms like neural networks to play the balance between in-sample performance and generalizability. Complex data structures can also play a crucial rule. This is why our simulation-based method is so important: it offers a comprehensive approach to evaluate model generalizability across diverse scenarios, providing a simple and effective solution to account for these complexities in real-world applications. 
% In real-world applications, generalizability depends on factors such as sample sizes and the complexity of the data structures. Our proposed simulation-based method offers a comprehensive framework for quantitatively evaluating model generalizability across diverse scenarios. 
This provides a simple and effective solution for assessing how well algorithms account for these complexities in real-world applications.

Consequently, we claim that our evaluation method is:
\begin{itemize}
    \item \textbf{\textit{Systematic}}: We offer a structured framework that allows users to easily specify and input flexible data generation processes for simulations.
    \item \textbf{\textit{Comprehensive}}: Our method supports simulations from various data generation processes, covering both continuous and discrete covariates and outcomes.
    \item \textbf{\textit{Robust}}: We incorporate statistical testing to evaluate the generalizability of distributional and mean regression models.
    \item \textbf{\textit{Realistic}}: Simulations can be based on actual data, bridging the gap between synthetic evaluations and real-world applications.
\end{itemize}
% Consequently, we claim that our evaluation method is \textbf{\textit{systematic}} - we offer a structured framework that allows users to easily specify and input flexible data generation processes for simulations, \textbf{\textit{comprehensive}} - our method supports simulations from various data generation processes, covering both continuous and discrete covariates and outcomes,
% % , with distributions like Gamma, Exponential, Gaussian, etc), 
% \textbf{\textit{robust}} - we incorporate statistical testing to evaluate the generalizability of distributional and mean regression models, and \textbf{\textit{realistic}} - simulations can be based on actual data, bridging the gap between synthetic evaluations and real-world applications.

\section{BACKGROUND}

Throughout the paper, we consider a static treatment model with an outcome $Y \in \mathcal{Y}\subseteq \mathbb{R}$ and a general treatment $X$ which can be either continuous or discrete. Let the set $D$ of measured pretreatment covariates be $\bm{Z} \in \mathcal{Z}\subseteq \mathbb{R}^{D}$. 
If we make the standard causal assumptions of SUTVA, positivity, and conditional ignorability outlined in \citet{pearl2009causality}, we define the marginal \textit{causal} treatment density as 
\begin{equation}
    p_{Y(x)}(y(x)) = \int p_{\YIZbX}(y \cmid \bm{z}, x) ~ p_{\bm{Z}}(\bm{z})~d\bm{z},
\end{equation} 
which is averaged over the covariate distribution.
% We distinguish between the marginal \textit{conditional} treatment density which is the marginalization over the observational dataset:
% \begin{equation}
%     p_{\YIX} = \int p_{\YxIZb} ~ p_{\ZbIX}~d\bm{z},
% \end{equation}
% and the marginal \textit{causal} treatment density:
% \begin{equation}
%     p_{Y(x)} = \int p_{\YxIZb} ~ p_{\bm{Z}}~d\bm{z}.
% \end{equation} which is the marginal from the randomized model.

We also use $\mu(x) = \mathbb{E}\left[Y(X=x)\right]$ to denote the expected outcome given an intervention on $X$. Correspondingly, we use $\mu(x,z) = \mathbb{E}\left[Y(X=x)\mid Z=z\right]$ to denote the conditional expectation of that outcome given covariate values. Note that $Y(x)$ is written as $Y \mid \text{do}(X=x)$ in the notation of \citet{pearl2009causality}. When the treatment is binary, we define $\tau = \mathbb{E}[Y(1) -Y(0)]$ as the average treatment effect (ATE), quantifying the overall impact of a treatment change across the entire population. Similarly, let $\tau(Z) = \mathbb{E}\left[Y(1) -Y(0)\mid Z\right]$ be the conditional average treatment effect (CATE), giving the result for specific subgroups or individuals, and therefore capturing treatment effect heterogeneity. 

We aim to evaluate the generalizability of an outcome regression model $\hat{f}(X,Z)$ that predicts the expected outcome $Y$, with the model's predicted outcomes indicated by a hat symbol. 
\subsection{Generalizability in Causal Inference}

Extensive research has focused on generalizability in causal inference, as mentioned in the Introduction. 
% Recently, combining Randomized Controlled Trials (RCT) data with observational data has shown promise for improving CATE estimations in real-world settings. Calibrating outcome models with observational data helps models trained on RCTs better generalize to diverse populations \citep{curth2021really}.
As highlighted by \cite{ling2022critical}, three common approaches are used to assess treatment effect generalizability: inverse probability of sampling weighting (IPSW) methods that adjust for differences between study and target populations by weighting based on sample inclusion probabilities \citep{buchanan2018generalizing}; outcome model-based methods that estimate the conditional outcome directly \citep{kern2016assessing}; and the hybrid approaches that combines both \citep{dahabreh2019generalizing}.

In this work, we focus on algorithms that generalize outcome predictions across different domains, enabling accurate CATE or COD estimation. This is crucial for understanding individual-level treatment effect heterogeneity and ensuring models can adapt to new populations or environments with varying covariate distributions. A summary of common CATE estimation methods is provided by \cite{caron2022estimating}.

Despite advancements in CATE estimation, a systematic framework for evaluating generalizability is still underdeveloped. Commonly current methods, like MSE and Precision in Estimation of
Heterogeneous Effect (PEHE), provide limited real-world insights \citep{curth2021really,kiriakidou2022evaluation}. To address this gap, we propose a systematic framework to evaluate how well CATE algorithms perform across domains with different covariate distributions, offering a more practical assessment of whether a given approach will generalize well.

\subsection{Frugal Parameterization}\label{subsec:frugal-params}
A frugal parameterization of an observational joint distribution, $P_{\ZbXY}$, factorizes the distribution into a set of causally relevant components~\citep{evans2024parameterizing}. This decomposition explicitly parameterizes the marginal causal effect, $P_{Y(x)}$ and builds the rest of the model around it. 

Let us start by first parameterizing the \textit{conditional outcome distribution} (COD), $P_{\YxIZb}$. Frugal models parameterize the COD in terms of the marginal causal effect, $P_{Y(x)}$, and a conditional copula distribution, $C_{\YxIZb}$. Here, the copula models the joint dependency between the marginal causal distribution and each of the univariate marginal covariate distributions, $\{P_{Z_i}\}_{i}$ such that:
\begin{equation}
    p_{\YxIZb} = p_{Y(x)} \cdot  c_{\YxIZb},
\end{equation}
where lowercase letters denote the corresponding density functions (see the Supplementary Material for further details on copulas). 
% on the ranks of the marginal probability integral transform of the covariates:
% \begin{equation}
%     C_{\YxIZb} := C\!\left(F_{Y(x)} \mid F_{Z_1},\dots,~F_{Z_{D}} \right).
% \end{equation}
% We present a summary of copulas in the appendix for unfamiliar readers, but in short, copulas provide a framework for encoding dependencies between marginal quantities in such a way that the marginal distributions are preserved.
This leaves the distribution of the \textit{past}, $P_{\bm{Z}X}$, i.e. the covariate distribution and the propensity score. Note that we assume that all covariates are strictly pretreatment, i.e. $\bm{Z}$ cannot include any mediators. The past and the COD are variation independent, in the sense that they parameterize separate, non-overlapping aspects of the joint distribution~\citep{evans2024parameterizing}. This allows the past to be freely specified without affecting either the conditional copula or the marginal causal effect.

% The distribution of the \textit{past}, $P_{\bm{Z}X}$, i.e. the covariate distribution and the propensity score, can be freely specified without modifying the marginal causal effect of the target distribution. This leaves the parameterization of the \textit{conditional outcome distribution} (COD) $P_{\YxIZb}$. Frugal models parameterize the COD in terms of the marginal causal effect, $P_{Y(x)}$, and a conditional copula distribution, $C_{\YxIZb}$. Here, $C_{\YxIZb}$ models the joint dependency between the marginal causal distribution and each of the univariate marginal covariate distributions, $\{P_{Z_i}\}_{i}$ such that:
% \begin{equation}
%     P_{\YxIZb} = P_{Y(x)} \cdot  C_{\YxIZb},
% \end{equation}
% where $C_{\YxIZb}$ is a copula distribution function on the ranks of the marginal probability integral transform of the covariates:
% \begin{equation}
%     C_{\YxIZb} := C\!\left(F_{Y(x)} \mid F_{Z_1},\dots,~F_{Z_{D}} \right).
% \end{equation}
% We present a summary of copulas in the appendix for unfamiliar readers, but in short, copulas provide a framework for encoding dependencies between marginal quantities in such a way that the marginal distributions are preserved.

% One can parameterize an observational distribution in such a way that the marginal causal distribution (rather than the marginal \textit{conditional}) in the equivalent randomized model is invariant to any choice of propensity score:
% \begin{equation}
%     P_{\ZbXY} = P_{\bm{Z}X} \cdot P_{Y(x)} \cdot  C_{\YxIZb}.
% \end{equation}
% We emphasize that frugal models target the causal rather than the marginal conditional distribution when the dependency measure is  parameterized by a multivariate copula. Consider the multivariate copula for the distribution of $\bm{Z}$ and $Y$ conditional on $X$: 
% $$C(F_{Y|X}, F_{Z_1|X},\dots, F_{Z_{D}|X}).$$
% For an intervened distribution, all pretreatment covariates $\bm{Z}$ are marginally independent of $X$, simplifying the copula to 
% $$ C(F_{Y|X}, F_{Z_1},\dots, F_{Z_{D}}),$$
% and so the intervened joint density becomes
% $$P_{\ZbYx} = P_{Y(x)} \cdot C(F_{Y(x)}, F_{Z_1},\dots, F_{Z_{D}}) \cdot \prod_{d=1}^{D} P_{Z_d},$$
% where $P_{Y(x)}$ is the marginal causal effect of $X$ on $Y$. The final density $P_{\XIZb}$ is the propensity score, which does not affect the aforementioned marginal densities in the observational model \cite{barndorff2014information}.

\section{METHOD}
\Cref{fig:algo-workflow} provides an overview of our workflow. We begin by defining both a test and a training domain, each with a distribution over the pretreatment covariates and the treatment, allowing for distribution shifts across covariates and treatment allocation. The COD is frugally parameterized with a conditional copula, where the covariates' cumulative distribution functions (CDFs) are derived from the test domain’s covariate densities. This ensures that samples from the test dataset follow a \textbf{known, customizable} marginal causal density, $p_{Y(x)}$.

The training data is generated from the same COD but with a non-analytic marginal causal density, as the training covariate densities do not match the covariate CDFs used to parameterize the conditional copula. We then learn a model, $\hat{f}(x, \bm{z})$, on the training data. Finally, a statistical test is performed to validate whether the lower dimensional marginal quantity (e.g.~ATE, $\tau$, or the expected potential outcome, $\mu(x)$)  estimated using model outcomes equals the ground truth in the test domain.

\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=1\linewidth]{algorithm.png}}
\vspace{.3in}
\caption{Workflow of the Proposed Method.}
\label{fig:algo-workflow}
\end{figure}

\subsection{Data Simulation}
In this section we describe how to simulate the data. We show that we can construct two datasets with distinct covariate and treatment distributions, the exact same COD, but where in one domain the marginal causal effect is well understood.

% This involves first parameterizing terms in the test domain,  defining the marginal causal density and the conditional copula density. We define an invariant across both domains in terms of the marginal causal density and the conditional univariate copula density. Everything else can change.

\subsubsection{Multi-domain Simulation with Frugal Models}
We begin by specifying two data generating processes: the training data, $D^{A} \sim P^{A}_{\bm{Z}XY}$, and the test data, $D^{B} \sim P^{B}_{\bm{Z}XY}$. Our goal is to construct a COD that parameterizes the joint density across both domains, while ensuring that the marginal causal density in domain $B$ is parameterized by $p_{Y(x)}$. 

Recall from \Cref{subsec:frugal-params} that a general observational density can be factorized into the \textit{past}, $p_{\bm{Z}X}$, and the COD, $p_{\YxIZb}$:
\begin{equation}\label{eq:cod}
    \begin{aligned}
        p&_{\YxIZb}(y \mid \bm{z}) = p_{Y(x)}(y) \times \\ 
        & \qquad c_{\YxIZb}\!\left(F_{Y(x)}(y) \mid F_{Z_{1}}(z_1),\dots, F_{Z_{D}}(z_{d}) \right)
    \end{aligned}
\end{equation}
where $F_{Y(x)}$ is the CDF associated with the marginal causal density $p_{Y(x)}$. 

Note that the copula density in (\ref{eq:cod}) is not only determined by the copula's family and its parameterization, but also by the choice of marginal CDFs for the covariates, $\bm{Z}$. If the conditional copula density is marginalized over the densities corresponding to the covariate CDFs, then the ranks of the marginal causal density will be uniformly distributed:
\begin{equation}
    p\left(F_{Y(x)}\right) = \int d\bm{z}~c_{\YxIZb}(y(x) \cmid \bm{z}) \cdot \prod_{d=1}^{D}p_{Z_{d}}(z_{d}) = 1.
\end{equation}
However, this uniformity is guaranteed only if the marginal covariate densities $\{ p_{Z_d} \}_{d=1}^{D}$ correspond to the CDFs used to parameterize the copula. If we instead consider a set of alternative marginal densities, $\{p'_{Z_d}\}_{d=1}^{D}$, are not derived from the CDFs within the copula, i.e. $F_{Z_{d}}(Z_{d} = t) \neq F_{Z'_{d}}(Z'_{d} = t)$
% \begin{equation*}
%     F_{Z_{d}}(Z_{d} = t) \neq F_{Z'_{d}}(Z'_{d} = t)
%     % F_{Z_{d}}(z_{d}) \neq \int_{-\infty}^{z_{d}} p_{Z'_{d}}(t)~dt
% \end{equation*}
then the rank uniformity is not assured. Thus, the COD density is generally valid under any distribution of the past, and will not guarantee the sampling from the specified marginal causal density if the covariate densities are derived from the CDFs that parameterize the copula. We present the conditions by which alternative distributions will yield samples drawn from the specified marginal causal density, assuming that the conditional copula density is Gaussian in the Supplementary Material. Given how rarely these conditions are satisfied,
% especially in high-dimensional settings where the marginal distributions of the covariates can become quite complex, 
we do not believe this will commonly be encountered in semi-synthetic benchmark generation. These conditions will likely be even harder to satisfy if a more complex multivariate copula (such as non-Gaussian vine) is chosen.

For evaluating generalization, we set the CDFs within the copula density to be derived from the covariate densities in the test domain $P_{\bm{Z}XY}^{B}$. This allows us to construct the COD density across all covariate spaces,
\begin{equation}
    \begin{aligned}
        p&_{\YxIZb}\left( y \cmid \bm{z}\right) = p^{B}_{Y(x)}\left(y\right) \times \\ 
        & \qquad c_{\YxIZb}\!\left(F^{B}_{Y(x)}(y) \,\middle|\, F_{Z_1^{B}}(z_1), \dots, F_{Z_{D}^{B}}(z_D) \right)
    \end{aligned}
\end{equation}
which will sample from a known marginal causal density equal to $p_{Y(x)}$ if the covariate CDFs in the copula are derived from the test domain covariate densities. 
% We are free to vary the density of $p_{\bm{Z}X}^{A}$ in the training data to be any alternative, while retaining the form of the COD.
% We reemphasize: if covariate CDFs which parameterize the COD are equal to those which parameterize the past, we will sample from a joint distribution with the exact marginal causal density specified by $p_{Y(x)}$. 

This offers a great deal of flexibility in testing method generalizability. One can draw training and test datasets with different covariate densities and propensity scores, while guaranteeing that the CODs remain consistent, and that the test data is drawn from a distribution with a marginal causal density parameterized by $p_{Y(x)}$. 


% When it comes to simulation based on actual data, we begin the process by first learning the marginal distributions of the pretreatment covariates in the test set, $\{\hat{F}^{B}_{Z}\}_{i}$. In this paper, we estimate these distribution functions using the empirical CDF, though more flexible techniques like Kernel Density Estimation could also be employed. Next, we learn a parametric form of marginal causal effect $\hat{p}^{B}_{Y(x)}$ alongside a multivariate Gaussian copula $\hat{c}^{B}_{\ZbYx}\left(\hat{F}^{B}_{Y(x)}, \hat{F}^{B}_{Z_1}, \dots, \hat{F}^{B}_{Z_D} \right)$ which captures the marginal dependency between all the covariates and the marginal causal effect. While we assume a Gaussian copula for its simplicity, more flexible models such as vine copulas could also be used for greater flexibility. Finally, we estimate the propensity score model $\hat{p}^{B}_{\XIZb}$. With all these components,
% \begin{equation*}
%     \{\hat{F}^{B}_{Z_{1}}, \dots, \hat{F}^{B}_{Z_{D}}, \hat{p}^{B}_{Y(x)}, \hat{p}^{B}_{\XIZb},  \hat{c}^{B}_{\ZbYx} \}
% \end{equation*}
% we can draw samples from the test distribution.

% We follow similar steps to sample training datasets, estimating $\{\hat{F}^{A}_{Z}\}_{i}$ and the Gaussian copula between the covariates, $\hat{c}^{A}_{\bm{Z}}$. We use the conditional univariate copula $\hat{c}^{A}_{\YxIZb}$ derived from the multivariate testing copula, $\hat{c}^{B}_{\ZbYx}$. Finally, we estimate the training propensity score model, $\hat{p}_{\XIZb}$. With the following components,
% \begin{equation*}
%     \{\hat{F}^{A}_{Z_{1}}, \dots, \hat{F}^{A}_{Z_{D}}, \hat{p}^{B}_{Y(x)}, \hat{c}^{B}_{\YxIZb}, \hat{p}^{A}_{\XIZb},  \hat{c}^{A}_{\bm{Z}}, \}
% \end{equation*}
% we can draw training samples with covariates and treatment assignments similarly distributed to the real dataset, but with the same COD as testing data. In this way, we allow for covariate and treatment distribution shift, while ensuring the marginal causal density in the test data is exactly equal to $\hat{p}^{B}_{Y(x)}$.

% \begin{algorithm}[t]
% \caption{Semi-synthetic Data Generation Process.}
% \begin{algorithmic}%[1]
% \vspace*{2pt}
% \STATE{\textbf{Step 1}: Learn marginal covariate distributions $\hat{F}^{B}_{Z}$.}
% \STATE{\textbf{Step 2}: Estimate $\hat{p}^{B}_{Y(x)}$.}
% \STATE{\textbf{Step 3}: Learn the multivariate Gaussian copula $\hat{c}^{B}_{\ZbYx}$.}
% \STATE{\textbf{Step 4}: Calculate the conditional univariate copula $\hat{c}^{B}_{\YxIZb}$.}
% \STATE{\textbf{Step 5}: Estimate $\hat{p}^{B}_{\XIZb}$.}
% \STATE{\textbf{Step 6}: Simulate data from the test distribution.}
% \STATE{\textbf{Step 7}: Learn marginal distributions $\hat{F}^{A}_{Z}$ and the Gaussian copula $\hat{c}^{A}_{\bm{Z}}$.}
% \STATE{\textbf{Step 8}: Sample ranks for covariates in domain $A$, $\Phi^{-1}\left(\hat{\bm{u}}_{\bm{Z}}^{A}\right) \sim \hat{c}^{A}_{\bm{Z}}$.}
% \STATE{\textbf{Step 9}: Calculate training covariate samples, $\hat{z}^{A}_{i} = \hat{F}_{Z_{i}}^{A}^{-1}(u_{Z_i})$}
% \STATE{\textbf{Step 10}: Estimate the propensity score model $\hat{p}^{A}_{\XIZb}$.}
% \STATE{\textbf{Step 11}: Draw treatment samples, $\hat{x}^{A} \sim \hat{p}^{A}_{\XIZb}(\cdot \mid )$.}
% \STATE{\textbf{Step 12}: Calculate the marginal causal rank in domain $A$, $u_{Y(x)}^{A} \sim \hat{c}^{B}_{\YxIZb}\left( \cdot \mid \hat{F}_{Z_1}^{B}(u_{Z_1}^{A}), \dots, \hat{F}_{Z_D}^{B}(u_{Z_D}^{A}) \right)$}
% \STATE{\textbf{Step 13}: Simulate data from the training distribution.}
% \end{algorithmic}
% \label{alg:semisynthetic_data}
% \end{algorithm}
\subsubsection{Generating Semi-Synthetic Benchmarks}
% \begin{algorithm}[t]
% \caption{Semi-synthetic Data Generation Process.}
% \begin{algorithmic}[1]
% \STATE{\textbf{Step 1}: Estimate empirical CDFs for test data, $\hat{F}^{B}_{Z_d},~ \forall ~ d = 1, \dots, D$.}
% \STATE{\textbf{Step 2}: Estimate marginal causal density and joint copula for test data, $\hat{p}^{B}_{Y(x)}, \hat{c}^{B}_{\ZbYx}$.}
% \STATE{\textbf{Step 3}: Estimate propensity score model for test data, $\hat{p}^{B}_{\XIZb}$.}
% \STATE{\textbf{Step 4}: Draw samples from the test copula, $\hat{\bm{u}}_{\bm{Z}}^{B} \sim \hat{c}^{B}_{\ZbYx}$.}
% \STATE{\textbf{Step 5}: Inverse transform to get covariate samples for test domain, $\hat{z}_{d}^{B} = \hat{F}_{Z_d}^{B^{-1}}(u_{Z_d}^{B}), \forall d = 1, \dots, D$.}
% \STATE{\textbf{Step 6}: Sample treatment variable for test domain using the propensity score model, $\hat{x}^{B} \sim \hat{p}^{B}_{\XIZb}(\cdot \mid \bm{Z}^{B})$.}
% \STATE{\textbf{Step 7}: Inverse transform to get outcome samples for test data, $\hat{y}^{B} = \hat{F}_{Y(x)}^{B^{-1}}(u_{Y(x)}^{B})$, where $u_{Y(x)}^{B} \sim \hat{c}^{B}_{\YxIZb}\left( \cdot \mid \hat{F}_{Z_1}^{B}(u_{Z_1}), \dots, \hat{F}_{Z_D}^{B}(u_{Z_D}) \right)$.}
% \STATE{\textbf{Step 8}: Estimate empirical CDFs and copula for training data, $\hat{F}^{A}_{Z_d}, \hat{c}^{A}_{\bm{Z}}, \forall d = 1, \dots, D$.}
% \STATE{\textbf{Step 9}: Sample covariate distributions for training data, $\hat{\bm{u}}_{\bm{Z}}^{A} \sim \hat{c}^{A}_{\bm{Z}}$.}
% \STATE{\textbf{Step 10}: Inverse transform to get covariate samples for training data, $\hat{z}_{d}^{A} = \hat{F}_{Z_d}^{A^{-1}}(u_{Z_d}^{A}), \forall d = 1, \dots, D$.}
% \STATE{\textbf{Step 11}: Estimate propensity score model for training data, $\hat{p}^{A}_{\XIZb}$.}
% \STATE{\textbf{Step 12}: Sample treatment variable for training data, $\hat{x}^{A} \sim \hat{p}^{A}_{\XIZb}(\cdot \mid \bm{Z}^{A})$.}
% \STATE{\textbf{Step 13}: Calculate marginal causal rank for training data, $u_{Y(x)}^{A} \sim \hat{c}^{B}_{\YxIZb}\left( \cdot \mid \hat{F}_{Z_1}^{B}(u_{Z_1}^{A}), \dots, \hat{F}_{Z_D}^{B}(u_{Z_D}^{A}) \right)$.}
% \STATE{\textbf{Step 14}: Inverse transform to get outcome samples from training data, $\hat{y}^{A} = \hat{F}_{Y(x)}^{B^{-1}}\left(u_{Y(x)}^{A} \right)$.}
% \end{algorithmic}
% \label{alg:semisynthetic_data}
% \end{algorithm}

% \begin{algorithm}[h!]
% \caption{Semi-synthetic Data Generation Process.}
% \begin{algorithmic}
% % \scriptsize
% \vspace*{2pt}
% \STATE{\textbf{Input}:~Original test data; original covariates and treatment from training data.}
% \vspace*{2pt}
% \STATE{\textbf{Parameter estimations on test domain}} 

% \hspace*{8pt} Estimate test empirical CDFs, $\{\hat{F}^{B}_{Z_d}\}_{d=1}^D$; marginal causal density and joint copula $\hat{p}^{B}_{Y(x)}, \hat{c}^{B}_{\ZbYx}$; propensity score model $\hat{p}^{B}_{\XIZb}$.
% \vspace*{2pt}
% \STATE{\textbf{Transformation on test domain}}

% \hspace*{8pt} Sample $(\bm{u}_{\bm{Z}}^{B}, u_{Y(x)}) \sim \hat{c}^{B}_{\ZbYx}$;\\
% \hspace*{8pt} Calculate$\{z_{d}^{B} = \hat{F}_{Z_d}^{B^{-1}}(u_{Z_d}^{B})\}_{d=1}^D$;\\
% \hspace*{8pt} Sample $x^{B} \sim \hat{p}^{B}_{\XIZb}\left(\cdot \mid \bm{Z}^{B}\right)$;  $y^{B} = \hat{F}_{Y(x)}^{B^{-1}}(u_{Y(x)}^{B})$.

% \vspace*{2pt}
% \STATE{\textbf{Parameter estimations on training domain}}

% \hspace*{8pt} Estimate training empirical CDFs, copula and propensity score model $\{\hat{F}^{A}_{Z_d}\}_{d=1}^D,~\hat{c}^{A}_{\bm{Z}}$,  $\hat{p}^{A}_{\XIZb}$.

% \vspace*{2pt}
% \STATE{\textbf{Transformation on training domain}} 

% \hspace*{8pt} Sample $u_{\bm{Z}}^{A} \sim \hat{c}^{A}_{\bm{Z}}$;\\
% \hspace*{8pt} Calculate $\{z_{d}^{A} = \hat{F}_{Z_d}^{A^{-1}}(u_{Z_d}^{A})\}_{d=1}^D$;\\
% \hspace*{8pt} Sample $x^{A} \sim \hat{p}^{A}_{\XIZb}(\cdot \mid \bm{z}^{A})$;
% \hspace*{8pt} Sample $u_{Y(x)}^{A} \sim \hat{c}^{B}_{\YxIZb}( \cdot \mid \hat{F}_{Z_1}^{B}(z_{1}^{A}), \dots, \hat{F}_{Z_D}^{B}(z_{D}^{A}))$;\\
% \hspace*{8pt} Sample  $y^{A} = \hat{F}_{Y(x)}^{B^{-1}}\left(u_{Y(x)}^{A} \right)$.

% \vspace*{2pt}
% \STATE{\textbf{Output}: $D^{A} = \{(z^{A}, x^{A}, y^{A})\}_{i}$}, $D^{B} = \{(\bm{z}^{B}, x^{B}, y^{B})\}_{i}$ 
% \end{algorithmic}
% \label{alg:semisynthetic_data}
% \end{algorithm}
\begin{algorithm}[h!]
\caption{Semi-synthetic Data Generation.}
\begin{algorithmic}
% \scriptsize
\vspace*{2pt}
\STATE{\textbf{Input}:~Original test data; original covariates and treatment from training data.}
\vspace*{2pt}
\STATE{\textbf{Parameter estimations on test domain}} 

Estimate test empirical CDFs, $\{\hat{F}^{B}_{Z_d}\}_{d=1}^D$; marginal causal density, $\hat{p}^{B}_{Y(x)}$; joint copula , $\hat{c}^{B}_{\ZbYx}$; propensity score model $\hat{p}^{B}_{\XIZb}$.
\vspace*{2pt}
\STATE{\textbf{Transformation on test domain}}

Sample $(\bm{u}_{\bm{Z}}^{B}, u_{Y(x)}) \sim \hat{c}^{B}_{\ZbYx}$;\\
Calculate $\{z_{d}^{B} = [\hat{F}_{Z_d}^{B}]^{-1}(u_{Z_d}^{B})\}_{d=1}^D$;\\
Sample $x^{B} \sim \hat{p}^{B}_{\XIZb}\left(\cdot \mid \bm{Z}^{B}\right)$;\\
Calculate $y^{B} = \hat{F}_{Y(x)}^{B^{^{-1}}}(u_{Y(x)}^{B})$.

\vspace*{2pt}
\STATE{\textbf{Parameter estimation on training domain}}

Estimate training empirical CDFs, $\{\hat{F}^{A}_{Z_d}\}_{d=1}^D$; covariate copula, $\hat{c}^{A}_{\bm{Z}}$;  propensity score model,  $\hat{p}^{A}_{\XIZb}$.

\vspace*{2pt}
\STATE{\textbf{Transformation on training domain}} 

Sample $u_{\bm{Z}}^{A} \sim \hat{c}^{A}_{\bm{Z}}$;\\
Calculate $\{z_{d}^{A} = [\hat{F}_{Z_d}^{A}]^{-1}(u_{Z_d}^{A})\}_{d=1}^D$;\\
Sample $x^{A} \sim \hat{p}^{A}_{\XIZb}(\cdot \mid \bm{z}^{A})$; \\
Sample $u_{Y(x)}^{A} \sim \hat{c}^{B}_{\YxIZb}( \cdot \mid \hat{F}_{Z_1}^{B}(z_{1}^{A}), \dots, \hat{F}_{Z_D}^{B}(z_{D}^{A}))$;\\
Calculate  $y^{A} = [\hat{F}_{Y(x)}^{B}]^{-1}\left(u_{Y(x)}^{A} \right)$.

\vspace*{2pt}
\STATE{\textbf{Output}: Training sample $D^{A} = (\bm{z}^{A}, x^{A}, y^{A})$};\\ \hspace*{40pt} Test sample $D^{B} = (\bm{z}^{B}, x^{B}, y^{B})$. 
\end{algorithmic}
\label{alg:semisynthetic_data}
\end{algorithm}
In cases where real-data is available, we follow the workflow outlined in \Cref{alg:semisynthetic_data}. First, we estimate the empirical CDFs of the pretreatment covariates for the test data, denoted as $\hat{F}^{B}_{Z_d},~\forall ~d = \{1,\dots, D\}$. We then estimate the marginal causal density $\hat{p}^{B}_{Y(x)}$ and the joint copula $\hat{c}^{B}_{\ZbYx}$, capturing the covariate-outcome dependency conditional on treatment. With the test copula known, we draw samples $\bm{u}_{\bm{Z}}^{B} \sim \hat{c}^{B}_{\ZbYx}$, and use inverse transforms to generate the covariate samples $z_{d}^{B} = \hat{F}_{Z_d}^{B^{-1}}(u_{Z_d}^{B})$. Next, we estimate the propensity score model for the test data, $\hat{p}^{B}_{\XIZb}$ and sample the treatment variable $x^{B} \sim \hat{p}^{B}_{\XIZb}(\cdot \mid \bm{z}^{B})$. The outcome for the test data calculating using $y^{B} = \hat{F}_{Y(x)}^{B^{-1}}(u_{Y(x)}^{B})$, where $u_{Y(x)}^{B}$ is the sampled outcome rank from the copula. For the training data, we follow a similar approach. Details can be found in \Cref{alg:semisynthetic_data}. %With this approach we get the semi-synthetic samples from test domain and training domain.

% First, we estimate the empirical CDFs $\hat{F}^{A}_{Z_d},~\forall ~ d = \{1, \dots, D\}$ and the covariate copula $\hat{c}^{A}_{\bm{Z}}$. We draw samples from this copula, $\hat{\bm{u}}_{\bm{Z}}^{A}$, and perform an inverse transform to generate the actual covariate samples, $\hat{z}_{d}^{A} = \hat{F}_{Z_d}^{A^{-1}}(\hat{u}_{Z_d}^{A})$.

% We then estimate the propensity score model for the training data, $\hat{p}^{A}_{\XIZb}$, and use it to sample the treatment variable, $\hat{x}^{A} \sim \hat{p}^{A}_{\XIZb}(\cdot \mid \hat{\bm{z}}^{A})$. The marginal causal rank for the training data is calculated as $\hat{u}_{Y(x)}^{A}$, using the copula $\hat{c}^{B}_{\YxIZb}$ from the test data:
% \begin{align*}
%     \hat{u}_{Y(x)}^{A} &\sim \hat{c}^{B}_{\YxIZb}\big( \cdot \mid \\
%     &\hat{F}_{Z_1}^{B}(\hat{F}_{Z_1}^{A^{-1}}(u_{Z_1}^{A})), \dots, \hat{F}_{Z_D}^{B}(\hat{F}_{Z_D}^{A^{-1}}(u_{Z_D}^{A})) \big).
% \end{align*}
% Finally, we perform an inverse transform to obtain the outcome samples for the training data, $\hat{y}^{A} = \hat{F}_{Y(x)}^{B^{-1}}(\hat{u}_{Y(x)}^{A})$, where we make sure to use the marginal causal effect parameters and the conditional copula $\hat{c}^{B}_{\YxIZb}$ are derived from the test data to ensure the test and training CODs are identical.

% 3) Sample from copula

% 4) Draw samples from the copula, and inverse CDF transform to draw samples from Z.

% 5) Estimate propensity score

% 6) Sample treatment

% 7) Sample outcome using $y^{(i)} = \hat{F}_{Y(x)}^{B}(u^{(i)}_{Y(x)} \mid x^{(i)})$

% 8) Do the same for the training data, except make sure that when sampling the quantiles from the sample outcome, $u_{Y(x)}^{A} \sim \hat{c}^{B}_{\YxIZb}\left( \cdot \mid u_{Z_1}^{A}, \dots, u_{Z_D}^{A} \right)$

% 9) Sample the training outcome $y^{A} = \hat{F}^{B}_{Y(x)}\left( u_{Y(x)}^{A} \mid x^{(i)}_{A} \right)$


\subsection{Statistical Testing}

Given that we know the marginal causal density parameterized by $p^{B}_{Y(x)}$ from the frugal parameterization, we are able to develop statistical testing on  $\mathcal{H}_0: \mathbb{E}\left[\hat{\mu}^{B}(x)\right] = \mu^{B}(x)$ instead of $\mathcal{H}_0: \mathbb{E}\left[\hat{\mu}^{B}(x,\bm{z})\right] =  \mu^{B}(x,\bm{z})$ for mean regression models, and $\mathcal{H}_0: \hat{P}^{B}_{Y(x)} = P^{B}_{Y(x)}$ instead of $\mathcal{H}_0: \hat{P}^{B}_{\YxIZb} = P^{B}_{\YxIZb}$ for distributional regression.


Our testing algorithms require some parameters: $N_B$ as the number of bootstrap samples, $N^{tr}$, $N^{te}$ as the number of samples simulated from training domain and test domain for each bootstrap, respectively; for distributional testing, we also need to specify $N_Y$, which is the number of outcome samples simulated from distributional regression output for each $\hat{f}(x,
\bm{z})$. We provide testing methods for two types of regression models: mean regression in  \Cref{alg:mean_test_algo} or distributional regression in  \Cref{alg:dist_test_algo}. Note that, in \Cref{alg:mean_test_algo}, we can replace $\mu^{te}(x)$ with $\tau^{te}$ as the reference target when $X$ is binary, which is what we used in our experiments. The testing method used in \Cref{alg:dist_test_algo} can also be replaced by other statistical tests, e.g.~Maximum Mean Discrepancy Test \citep{gretton2012kernel} or the Cramér-von Mises Test \citep{anderson1962distribution}.

\begin{algorithm}[t]
\caption{Generalizability Evaluation on Mean Regression Models.}
\begin{algorithmic}%[1] % this prints line nubmers
% \scriptsize
\vspace*{2pt}
\STATE{\textbf{Input}:~~$\Theta^{tr}$: parameters for training domain,\\
% \hspace*{34.3pt} $B$: number of bootstrap samples,\\
% % \hspace*{34.3pt} $\delta$: confidence level,\\
% \hspace*{34.3pt} $N^{te}$: number of $(X,Z)$ samples simulated for each bootstrap,\\
\hspace*{34.3pt} $\Theta^{te}$: parameters for test domain,\\
\hspace*{34.3pt} $\mu^{te}(x^0)$: reference.}
\vspace*{3pt}
\FOR{$b=1, \ldots, {N_B}$}
    \STATE{Draw $D_b^{tr}:= \{(\bm{z}'_{ib}, x'_{ib}, y'_{ib})\}_{i=1}^{N^{tr}} \sim P_{\Theta^{tr}}$};
    \STATE{Fit the mean regression model, $\hat{f}$, on $D_b^{tr}$};
    \STATE{Draw $D_b^{te}:= \{(\bm{z}_{ib}, x_{ib})\}_{i=1}^{N^{te}} \sim P_{\Theta^{te}}$};
    \STATE{Apply $\hat{f}$ on $D_b^{te}$ to get predictions $\{\hat{f}(x_{ib},\bm{z}_{ib})\}_{i = 1}^{N^{te}}$}; 
    \STATE{Calculate 
    $$\hat{\mu}_b^{te}(x^0) = \frac{\sum_{i=1}^{N^{te}} \mathbb{1}(x_{ib}=x^0)\hat{f}(x_{ib},\bm{z}_{ib})}{\sum_{i=1}^{N^{te}}\mathbb{1}(x_{ib}=x^0)}.$$}
    % $$\hat{\mu}_b^{te}(x^0) = \frac{1}{\sum_{i=1}^{N^{te}}\mathbb{1}(x_{ib}=x^0)}\sum_{i=1}^{N^{te}} \mathbb{1}(x_{ib}=x^0)\hat{f}(x_{ib},\bm{z}_{ib})$$}.
\ENDFOR
\STATE{\textbf{end for}}
\vspace*{3pt}
% \STATE{Denote $l^{te}$, $u^{te}$ as the $(1-\delta)/2$ and $1-(1-\delta)/2$ quantiles of $\{\hat{\mu}_b^{te}(c)\}_{b=1}^B$}.
% \IF{$\mu^{te}\in \left[l^{te}, u^{te}\right]$}
\STATE{Get the p-value $p$ by conducting a t-test to compare the target parameter $\mu^{te}(x^0)$ and the distribution of $\{\hat{\mu}_b^{te}(x^0)\}_{b=1}^{N_B}$}.
% \IF{$\mu^{te}\in \left[l^{te}, u^{te}\right]$}
% \STATE{\textbf{Return} True.}
% \ELSE
% \STATE{\textbf{Return} False.}
% \ENDIF
\STATE{\textbf{Return} $p$.}
\vspace*{3pt}
\end{algorithmic}
\label{alg:mean_test_algo}
\end{algorithm}

% \begin{algorithm}[t]
% \caption{Generalizability Evaluation on Distributional Regression Models.}
% \begin{algorithmic}%[1] % this prints line numbers
% \vspace*{2pt}
% \STATE{\textbf{Input}:~~$\hat{f}$: fitted distributional regression model,\\
% \hspace*{34.3pt} $B$: number of bootstrap samples,\\
% % \hspace*{34.3pt} $\alpha$: significance level,\\
% \hspace*{34.3pt} $N^{te}$: number of $(X,Z)$ samples generated in each bootstrap,\\
% \hspace*{34.3pt} $N_Y$: number of $Y$ samples simulated from distributional regression output $\hat{f}(X,Z)$,\\
% \hspace*{34.3pt} $\Theta^{te}$: parameters for test domain,\\
% \hspace*{34.3pt} $\mathbb{P}(Y|do(X=c))$: reference.}
% \vspace*{3pt}
% \FOR{$b=1, \ldots, B$}
%     \STATE{Draw sample data $D_b^{te}:= \{(X_{ib},Z_{ib})\}_{i=1}^{N^{te}} \sim P_{\Theta^{te}}$};
%     \STATE{Apply $\hat{f}$ on $D_b^{te}$ to get distributional predictions $\hat{\mathbb{P}}\left(Y|X_{ib}, Z_{ib}\right)$};
%     \STATE{For each $i$, sample $\{Y^j_{ib}\}_{j=1}^{N_Y}$ from $\hat{\mathbb{P}}(Y|X_{ib}, Z_{ib})$}.
% \ENDFOR
% \STATE{\textbf{end for}}
% \vspace*{3pt}
% \STATE{Estimate $ \smash{\hat{P}(Y \mid do(X) = c) = \bigcup_{b=1}^{B} \bigcup_{i=1}^{N^{te}} \bigcup_{j=1}^{N_Y} \left\{ Y_{ib}^j \mid X_{ib} = c \right\}}$.}
% \STATE{Conduct distribution tests, e.g., the Kolmogorov-Smirnov test, to evaluate $\mathcal{H}_0:\hat{P}(Y \mid do(X) = c) =P(Y \mid do(X) = c)$ and get p-value $p$.}
% % \IF{$p>\alpha$}
% % \STATE{\textbf{Return} True.}
% % \ELSE
% % \STATE{\textbf{Return} False.}
% % \ENDIF
% \STATE{\textbf{Return} $p$.}
% \vspace*{3pt}
% \end{algorithmic}
% \label{alg:dist_test_algo}
% \end{algorithm}
\begin{algorithm}[t]
\caption{Generalizability Evaluation on Distributional Regression Models.}
\begin{algorithmic}%[1] % this prints line numbers
% \scriptsize
\vspace*{2pt}
\STATE{\textbf{Input}:~~$\Theta^{tr}$: parameters for training domain,\\
% \hspace*{34.3pt} $B$: number of bootstrap samples,\\
% % \hspace*{34.3pt} $\alpha$: significance level,\\
% \hspace*{34.3pt} $N^{te}$: number of $(X,Z)$ samples simulated in each bootstrap,\\
% \hspace*{34.3pt} $N_Y$: number of $Y$ samples simulated from distributional regression output $\hat{f}(X,Z)$,\\
\hspace*{34.3pt} $\Theta^{te}$: parameters for test domain,\\
\hspace*{34.3pt} $P^{te}_{Y(x^0)}$: reference.} \\
% \hspace*{34.3pt} $P_{Y(x)}^{te}(\cdot \mid x)$: reference.}
\vspace*{3pt}
\FOR{$b=1, \ldots, N_B$}
    \STATE{Sample $D_b^{tr}:= \{(\bm{z}'_{ib}, x'_{ib},y'_{ib})\}_{i=1}^{N^{tr}} \sim P_{\Theta^{tr}}$}; 
    \STATE{Fit the distributional regression model, $\hat{f}$, on $D_b^{tr}$};
    \STATE{Sample $D_b^{te}:= \{\bm{z}_{ib}, x_{ib})\}_{i=1}^{N^{te}} \sim P_{\Theta^{te}}$}; 
    \STATE{Apply $\hat{f}$ on $D_b^{te}$ to get distributional predictions $\hat{P}_{Y(x_{ib})|z_{ib}}$};
    \STATE{For each $i$, sample $\{y^j_{ib}\}_{j=1}^{N_Y}$ from $\hat{P}_{Y(x_{ib})|z_{ib}}$}.
\ENDFOR
\STATE{\textbf{end for}}
\vspace*{3pt}
\STATE{Estimate $\hat{P}^{te}_{Y(x^0)} =$}
% \STATE{\hspace{1em} $\hat{P}(Y \mid do(X) = c) =$}
\STATE{\hspace{2em} $\bigcup_{b=1}^{N_B} \bigcup_{i=1}^{N^{te}} \bigcup_{j=1}^{N_Y} \left\{ y_{ib}^j \mid x_{ib} = x^0 \right\}$}.
\STATE{Conduct distribution tests, e.g.~the Kolmogorov-Smirnov test, for $\mathcal{H}_0:  \hat{P}^{te}_{Y(x^0)}=P^{te}_{Y(x^0)}$ and get the p-value $p$.}
\STATE{\textbf{Return} $p$.}
\vspace*{3pt}
\end{algorithmic}
\label{alg:dist_test_algo}
\end{algorithm}



A summary of this workflow is presented in \Cref{fig:algo-workflow}.


\section{EXPERIMENTS}


In this section, we use our workflow to evaluate the generalizability of a range of modern causal models.


As discussed in several review papers like \cite{curth2021really}, \cite{ling2022critical} and 
 \cite{kiriakidou2022evaluation}, methods such as Meta-Learners (e.g.~T- and S-learners) \citep{kunzel2019metalearners}, CausalForest \citep{wager2018estimation}, TARNet \citep{shalit2017estimating}, and BART \citep{chipman2010bart} are widely used for CATE estimation, each offering advantages in different scenarios. Our evaluation focuses on their performance under covariate distribution shifts, specifically examining the accuracy of their CATE estimations. Further details can be found in the Supplementary Material. 


Another interesting algorithm to be evaluated is engression, introduced in \cite{shen2023engression}. It approximates the conditional distribution using a pre-additive noise model. Targeting at a distributional regression, the model is capable of extrapolating to unseen or underrepresented data points through its learned non-linear transformations.  The key factors which affect engression's generalizability are the distances between two domains, and whether the true underlying function must be strictly monotonic in the extrapolation region. In our experiments, we evaluate engression in both the S-learner and T-learner settings.

\subsection{Synthetic Data}

We first conduct experiments on synthetic data to demonstrate and validate our method. While our approach can handle various data types and is particularly effective with high-dimensional covariates and continuous treatment interventions, for clarity, in this simple example, we focus on two continuous confounders, $Z_1$ and $Z_2$, sampled from identical gamma distributions, with a binary intervention $X$. We first focus on a randomized controlled trials (RCT) setting, $X \sim \operatorname{Bernoulli}(0.5)$.  Note that these parameters can be different between the two domains; here we just make them identical for simplicity in this experiment. We parameterize the Gaussian copula, $c_{\ZbYx}$, with Spearman correlation coefficients $\rho_{Z_1 Z_2} = 0$, $\rho_{Z_1Y(x)} = 0.1$ and $\rho_{Z_2Y(x)} = 0.9$. The distribution of $Y(x)$  is defined as $\mathcal{N}(1+2x,1)$ in the test domain. For the simulation, we generate $N^{tr} = 200$ training samples and 
$N^{te} = 50$ test samples per bootstrap, with $N_B=200$ bootstraps in total, repeating this process for 50 iterations. The marginal distributions of $Z_1$ and $Z_2$ in the training domain follow identical Gamma distributions with shape $k=1$ and rate $\theta=1$.

We examine two settings: in Setting 1, the test domain has a slight covariate shift, with $Z_1$ and $Z_2$ following a Gamma distribution of $k=2$, $\theta=1$. In Setting 2, the shift is more significant ($k=4$, $\theta=1$). Despite these shifts, the COD remains the same due to frugal parameterization, as shown in \Cref{fig:synthetic}.


\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=1\linewidth]{synthetic_group.png}}
\vspace{.3in}
\caption{Synthetic Data Generated from Setting 1 (Top) and Setting 2 (Bottom). }
\label{fig:synthetic}
\end{figure}

The p-values in \Cref{fig:synthetic_mean_p} illustrate the differences across models. As expected, with a more significant domain shift in Setting 2, models face greater difficulty in generalizing, as reflected by the smaller p-values generally compared to Setting 1. T-BART and T-engression showed good generalizability performances in this specific setting. TARNet struggles, likely due to the complexity of its representation learning network design and hyperparameter tuning.

\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=0.8\linewidth]{synthetic_mean_p.png}}
\vspace{.3in}
\caption{$p$-values of Mean Regression Testing, Synthetic Data of 50 Iterations.}
\label{fig:synthetic_mean_p}
\end{figure}

With our method, we are able to test the generalizability of distributional regression. \Cref{fig:synthetic_distribution_p} demonstrates the p-values of distributional regression testing of S-engression under the two settings, with $N_Y=50$. Not surprisingly, since the covariate distribution shift in Setting 1 is smaller, S-engression demonstrates better generalizability compared to that in Setting 2.

\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=0.75\linewidth]{synthetic_distribution_p_new.png}}
\vspace{.3in}
\caption{$p$-values of Distributional Regression Testing (Kolmogorov–Smirnov Test) of S-engression, Synthetic Data of 50 Iterations.}
\label{fig:synthetic_distribution_p}
\end{figure}

Supported by flexible simulations based on actual data, our method is useful for stress testing and model diagnostics. \Cref{fig:varying_n} illustrates an example where we examine how varying the training set size affects the generalizability of T-BART and T-engression. The generalizability performances of T-BART and T-engression worsen as $N^{tr}$ exceeds 100. This issue may stem from problems like overfitting, but solving these problems is not our focus. Rather, our method serves as a tool to detect and highlight potential issues when making predictions on real data, which is feasible with the simulation based on actual data using the frugal parameterization.

\begin{figure}[t]
\vspace{.3in}
\centerline{\includegraphics[width=0.8\linewidth]{varying_n_train.png}}
\vspace{.3in}
\caption{$p$-values of Mean Regression Testing of 50 Iterations, Varying $N^{tr}$, Setting 2, Synthetic Data.}
\label{fig:varying_n}
\end{figure}

Note that extrapolation performance for models like engression is typically evaluated visually, one dimension at a time. Our method, however, offers significant advantages by providing statistical evaluation of extrapolation performance in high-dimensional covariates.


\subsection{Real Data}

We evaluate algorithm generalizability using the Infant Health and Development Program (IHDP) dataset, a randomized experiment conducted between 1985 and 1988 to study the effect of home visits on infants' cognitive test scores~\citep{hill2011bayesian}. This dataset has become widely used in domain adaptation research \citep{johansson2018learning,curth2021really,shi2021invariant}.

The IHDP dataset contains $T=1000$ trials, each consisting of the same 747 subjects and 25 covariates, with the first six being continuous and the rest binary.  The potential outcomes $Y(1)$ and $Y(0)$ are provided in the data. In each trial $t$, $Y(0) \sim \mathcal{N}(\bm{Z}\beta_t,1)$, $Y(1) \sim \mathcal{N}(\bm{Z}\beta_t+4,1)$, and $\beta_t$ is randomly sampled from  $(0, 1, 2, 3, 4)$ with probabilities $(0.5, 0.2, 0.15, 0.1,0.05)$ . Thus, the potential outcomes vary across trials, while the covariates, CATE and ATE remain constant.

First we treat both domains as RCTs, that is, setting the propensity score model as $X\sim \operatorname{Bernoulli}(0.5)$ for all units. The observed outcome is then $Y = X Y(1) + (1-X) Y(0)$ by SUTVA. We randomly select 50 trials from the 1000 available, with each trial used to create one training-test pair, and evaluate the model's generalizability on them. To introduce domain shift, we keep all covariate values identical between the training and test domains, except for $Z_1$, which is set to 1.5 times the original value in the test domain compared to the training domain. For each training-test pair, we learn the parameters following \Cref{alg:semisynthetic_data}, specifying the marginal causal distribution to follow a Gamma distribution. We denote the resulting data generation distributions as $P_{\Theta^{tr}}, P_{\Theta^{te}}$ for the training and test domains, respectively. We sample training data of $N^{tr} = 1000$ from $P_{\Theta^{tr}}$, and $N^{te} = 200$ test data from $P_{\Theta^{te}}$. The number of bootstraps is set to be $N_B = 200$. 

\Cref{fig:ihdp_mean} shows the boxplot of p-values of each model and \Cref{tab:ihdp_percentage} contains the percentage of $p$ values greater than 0.05 across the $50$ trials.  T-/S-engression demonstrate better generalizability in this setting among all these methods.  We also give the result of distributional regression testing in \Cref{fig:ihdp_dist}. 


\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=0.75\linewidth]{ihdp_shift.png}}
\vspace{.3in}
\caption{Density of $Z_1$ of Training and Test Domains.}
\label{fig:ihdp_shift}
\end{figure}
\begin{figure}[t]
\vspace{.3in}
\centerline{\includegraphics[width=0.75\linewidth]{ihdp_mean.png}}
\vspace{.3in}
\caption{$p$-values of Mean Regression Testing of 50 Trials in IHDP.}
\label{fig:ihdp_mean}
\end{figure}

\begin{table}[h]
\caption{Percentage of $p > 0.05$, across 50 Trials.} 
\label{tab:ihdp_percentage}
\begin{center}
\begin{tabular}{ccc}
\hline
\textbf{Model} & \textbf{RCT} & \textbf{Non-RCT} \\
\hline
TARNet & 0 & 0 \\

CausalForest & 12\% & 6\%\\

S-BART & 12\% & 8\% \\

T-BART & 12\% & 6\% \\

S-engression & 18\% & 6\%\\
T-engression & 24\% & 8\%\\
\hline
\end{tabular}
\end{center}
\end{table}


\begin{figure}[t]
\vspace{.3in}
\centerline{\includegraphics[width=0.75\linewidth]{ihdp_dist_new_edit.png}}
\vspace{.3in}
\caption{$p$-values of Distributional Regression Testing of 50 Trials in IHDP.}
\label{fig:ihdp_dist}
\end{figure}

% We cover two simulation scenarios: Randomized Controlled Trials (RCT) and covariate imbalances across treatment arms by introducing propensity score models. 

While we use the RCT setting as an example above to demonstrate our method, it is also applicable to observational studies. The percentage of $p>0.05$ across 50 trials of each algorithm, when treatment arms are imbalanced in each trial by setting $P(X=1 \mid Z) = \operatorname{logit}(Z_2+Z_3+Z_4)$ can be found in \Cref{tab:ihdp_percentage}. Since our paper's focus is on providing a systematic generalizability evaluation method, we omit further analysis here.
% \begin{figure}[t]
% \vspace{.3in}
% \centerline{\includegraphics[width=0.75\linewidth]{ihdp_mean_obs.png}}
% \vspace{.3in}
% \caption{$p$-values of Mean Regression Testing across 50 Iterations, Non-randomized Study.}
% \label{fig:ihdp_mean_obs}
% \end{figure}

Details on hyperparameters and additional experiments, including performance comparisons with or without domain shift when the CATE is known to be linear, are provided in the Supplementary Material.

\section{SUMMARY}

In this paper, we develop a statistical method for evaluating the generalizability of causal inference algorithms using actual application data, facilitated by frugal parameterization. Our approach introduces a semi-synthetic simulation framework that bridges the gap between synthetic simulations and real-world applications, supporting the generalizability evaluation of both mean and distributional regression models. Through flexible, user-defined data generation processes, our framework provides robust statistical testing to assess how well models trained in one domain generalize to shifted domains. 

Through experiments on the synthetic and IHDP datasets, we assess the generalizability of algorithms such as TARNet, CausalForest, S-/T-BART, S-/T-engression under domain shift. Our method acts as a valuable diagnostic tool, allowing us to explore how factors like training set size or covariate shifts impact generalizability. These insights can help identify model strengths and weaknesses and inform how causal inference models adapt to different settings.

We remark that our approach of rejecting the null hypothesis shows that a model is not generalizable, but it does not quantify the extent of failure. An extention of this approach may be to develop a more flexible testing method, inspired by equivalence testing \citep{wellek2002testing}. This would assess not just whether a model fails but also by how much, determining if its performance is significantly worse than a given threshold. This offers a more nuanced view than traditional hypothesis testing. In this paper, we only consider marginal causal quantities as the validation references, but our framework can be easily adapted to use lower dimensional CODs as the reference instead.

We hope that this work inspires a more careful consideration of model evaluation, encourages simulations that better reflect real-world conditions, and highlights the importance of stress testing in advancing causal inference methodologies.

\clearpage
% \newpage



% \subsubsection*{Acknowledgements}
% All acknowledgments go at the end of the paper, including thanks to reviewers who gave useful comments, to colleagues who contributed to the ideas, and to funding agencies and corporate sponsors that provided financial support. 
% To preserve the anonymity, please include acknowledgments \emph{only} in the camera-ready papers.


% \bibliographystyle{plainnat}


\bibliography{references}
\clearpage
\section*{Checklist}


% %%% BEGIN INSTRUCTIONS %%%
% The checklist follows the references. For each question, choose your answer from the three possible options: Yes, No, Not Applicable.  You are encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description (1-2 sentences). 
% Please do not modify the questions.  Note that the Checklist section does not count towards the page limit. Not including the checklist in the first submission won't result in desk rejection, although in such case we will ask you to upload it during the author response period and include it in camera ready (if accepted).

% \textbf{In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.}
% %%% END INSTRUCTIONS %%%


 \begin{enumerate}


 \item For all models and algorithms presented, check if you include:
 \begin{enumerate}
   \item A clear description of the mathematical setting, assumptions, algorithm, and/or model. [Yes] \textit{We do our utmost to make this clear in our submission.}
   \item An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes] \textit{We are explicit about the sample sizes used in the paper, and have no inference algorithms as such to report.}
   \item (Optional) Anonymized source code, with specification of all dependencies, including external libraries. [Yes] \textit{We attach a requirements file to our submitted code.}
 \end{enumerate}


 \item For any theoretical claim, check if you include:
 \begin{enumerate}
   \item Statements of the full set of assumptions of all theoretical results. [Yes] \textit{We make this clear in either the main body or the Supplementary Material.}
   \item Complete proofs of all theoretical results. [Yes] \textit{Relevant proofs are either referenced or added to the Supplementary Material.}
   \item Clear explanations of any assumptions. [Yes] \textit{We tried our best to make them clear.}
 \end{enumerate}


 \item For all figures and tables that present empirical results, check if you include:
 \begin{enumerate}
   \item The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL). [Yes] \textit{All relevant code is included in our attached code. All external data we use is cited.}
   \item All the training details (e.g., data splits, hyperparameters, how they were chosen). [Yes] \textit{We discuss our fitting process in the Supplementary Materials.}
         \item A clear definition of the specific measure or statistics and error bars (e.g., with respect to the random seed after running experiments multiple times). [Yes] \textit{Done.}
         \item A description of the computing infrastructure used. (e.g., type of GPUs, internal cluster, or cloud provider). [Yes] \textit{We discuss computational requirements.}
 \end{enumerate}

 \item If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include:
 \begin{enumerate}
   \item Citations of the creator If your work uses existing assets. [Yes/No/Not Applicable] \textit{Cited in Supplementary Material.}
   \item The license information of the assets, if applicable. [Yes]
   \item New assets either in the supplemental material or as a URL, if applicable. [Yes]
   \item Information about consent from data providers/curators. [Yes]
   \item Discussion of sensible content if applicable, e.g., personally identifiable information or offensive content. [Not Applicable] \textit{We don't use sensitive material.}
 \end{enumerate}

 \item If you used crowdsourcing or conducted research with human subjects, check if you include:
 \begin{enumerate}
   \item The full text of instructions given to participants and screenshots. [Not Applicable] \textit{No crowdsorucing or human subjects used.}
   \item Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Applicable] \textit{No crowdsorucing or human subjects used.}
   \item The estimated hourly wage paid to participants and the total amount spent on participant compensation. [Not Applicable] \textit{No crowdsorucing or human subjects used.}
 \end{enumerate}

 \end{enumerate}


\end{document}
