\Cref{fig:algo-workflow} provides an overview of our workflow. We begin by defining both a test and a training domain, each with a distribution over the pretreatment covariates and the treatment, allowing for distribution shifts across covariates and treatment allocation. The COD is frugally parameterized with a conditional copula, where the covariates' cumulative distribution functions (CDFs) are derived from the test domain’s covariate densities. This ensures that samples from the test dataset follow a \textbf{known, customizable} marginal causal density, $p_{\Yx}$.

The training data is generated from the same COD, though since the covariate densities may not match the CDFs used to parameterize the conditional copula we do not have access to the marginal causal distribution in closed-form. We then learn a model, $\hat{f}(\bm{z},x)$, on the training data. Finally, a statistical test is performed to validate whether the lower-dimensional marginal quantity (such as the ATE or an expected potential outcome)  estimated using model outcomes equals the ground truth in the test domain.


\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=1\linewidth]{algorithm.png}}
\vspace{.3in}
\caption{Workflow of the Proposed Method.}
\label{fig:algo-workflow}
\end{figure}

\subsection{Data Simulation}
In this section we describe how to simulate the data.

% This involves first parameterizing terms in the test domain,  defining the marginal causal density and the conditional copula density. We define an invariant across both domains in terms of the marginal causal density and the conditional univariate copula density. Everything else can change.

\subsubsection{Multi-domain Simulation with Frugal Models}
We begin by specifying two data generating processes: the training data, $D^{A} \sim P^{A}_{\ZbXY}$, and the test data, $D^{B} \sim P^{B}_{\ZbXY}$. Our goal is to construct a COD that parameterizes the joint density across both domains, while ensuring that the marginal causal density in domain $B$ is parameterized by $p^{B}_{\Yx}$. The supports of covariates in domains $A$ and $B$ are denoted $\mathcal{Z}^A$, $\mathcal{Z}^B$.

Recall from \Cref{subsec:frugal-params} that a general observational density can be factorized into the \textit{past}, $p_{\bm{Z}X}$, and the COD:
\begin{equation}\label{eq:cod}
    \begin{aligned}
        p&_{\YxIZb}(y \cmid \bm{z}) = p_{\Yx}(y) \times \\ 
        & \qquad c_{\YxIZb}\!\left(F_{\Yx}(y) \mid F_{Z_{1}}(z_1),\dots, F_{Z_{D}}(z_{d}) \right),
    \end{aligned}
\end{equation}
where $F_{\Yx}$ is the CDF associated with the marginal causal density $p_{\Yx}$. 

Note that the copula density in (\ref{eq:cod}) is not only determined by the copula's family and its parameterization, but also by the choice of marginal CDFs for the covariates, $\bm{Z}$. If the conditional copula density is marginalized over the densities corresponding to the covariate CDFs, then the ranks of the marginal causal density will be uniformly distributed:
\begin{equation*}
    p\left(F_{\Yx}\right) = \int d\bm{z}~c_{\YxIZb}(y(x) \cmid \bm{z}) \cdot \prod_{d=1}^{D}p_{Z_{d}}(z_{d}) = 1.
\end{equation*}
This uniformity is guaranteed if the marginal covariate densities $\{ p_{Z_d} \}_{d=1}^{D}$ correspond to the CDFs used to parameterize the copula. Thus, data simulated using our method matches the marginal causal quantity we specify. 
% Generally, if we instead consider a set of alternative marginal densities, $\{p'_{Z_d}\}_{d=1}^{D}$, which are not derived from the CDFs within the copula, i.e. $F_{Z_{d}}(Z_{d} = t) \neq F_{Z'_{d}}(Z'_{d} = t)$ then the rank uniformity is not assured.

% However, . Thus, the COD density is generally valid under any distribution of the past, and will not in guarantee the sampling from the specified marginal causal density if the covariate densities are derived from the CDFs that parameterize the copula. In the Supplementary Material, we present the conditions by which alternative distributions will yield samples drawn from the specified marginal causal density, assuming that the conditional copula density is Gaussian. Given how rarely these conditions are satisfied, we do not believe this will commonly be encountered in semi-synthetic benchmark generation. These conditions will likely be even harder to satisfy if a more complex multivariate copula (such as non-Gaussian vines) is chosen. We refer the reader to the Supplementary Material for further details.

For evaluating generalization, we set the CDFs within the copula density to be derived from the covariate densities in the test domain $P_{\bm{Z}XY}^{B}$. This allows us to construct the COD density across all covariate and treatment spaces:
\begin{equation*}
    \begin{aligned}
        p&_{\YxIZb}\left( y \cmid \bm{z}\right) = p^{B}_{\Yx}\left(y\right) \times \\ 
        & \qquad c^{B}_{\YxIZb}\!\left(F^{B}_{\Yx}(y) \,\middle|\, F_{Z_1^{B}}(z_1), \dots, F_{Z_{D}^{B}}(z_D) \right),
    \end{aligned}
\end{equation*}
which will sample from a known marginal causal density equal to $p^{B}_{\Yx}$ if the covariate CDFs in the copula are derived from the test domain covariate densities. 

For two joint distributions with the same marginal covariate densities but different marginal causal densities, their CODs must differ. We can thus evaluate differences between CODs via comparing the lower-dimensional marginal causal densities instead.

This offers a great deal of flexibility in testing method generalizability. One can draw training and test datasets with different covariate densities and propensity scores, while guaranteeing that the CODs remain consistent, and that the test data is drawn from a distribution with a marginal causal density parameterized by $p^{B}_{\Yx}$. However, we note that a key assumption of our testing framework is $\mathcal{Z^A}\subseteq \mathcal{Z^B}$, as evaluating $p_{\Yx|\bm{Z}}$ requires evaluation of all marginal covariate CDF defined on domain $B$.


\begin{algorithm}[h!]
\caption{Semi-synthetic Data Generation.}
\begin{algorithmic}
% \scriptsize
\vspace*{2pt}
\STATE{\textbf{Input}:~Original test data; original covariates and treatment from training data.}
\vspace*{2pt}
\STATE{\textbf{Parameter estimations on test domain $B$}} 

Estimate the joint covariate-treatment density, $\hat{p}^{B}_{\ZbX}$; marginal causal density, $\hat{p}^{B}_{\Yx}$; conditional copula, $\hat{C}^{B}_{\YxIZb}$.
% \vspace*{2pt}
\STATE{\textbf{Data simulation on domain $B$}}

Sample $(\bm{z}^{B}, x^{B}) \sim \hat{p}^{B}_{\ZbX}$;\\
Sample the causal effect rank $\hat{u}^{B}_{\Yx|\bm{Z}} \sim U[0,1]$;\\
Calculate $y^{B} = {\big(\hat{F}_{\Yx}^{B}\big)^{-1}}\left(\hat{C}^{B}_{\Yx|\bm{Z}}(\hat{u}^{B}_{\Yx|\bm{Z}} \mid \bm{z}^{B})\right)$.

\vspace*{2pt}
\STATE{\textbf{Parameter estimation on training domain $A$}}

Estimate the joint covariate-treatment density, $\hat{p}^{A}_{\bm{Z}X}$.

\vspace*{2pt}
\STATE{\textbf{Data simulation on domain $A$}} 

Sample $(\bm{z}^{A}, x^{A}) \sim p^{A}_{\ZbX}$;\\
Sample the causal effect rank $\hat{u}_{\Yx|\bm{Z}}^{A} \sim U[0,1]$;\\
Calculate $y^{A} = {\big(\hat{F}_{\Yx}^{B}\big)^{-1}}\left(\hat{C}^{B}_{\Yx|\bm{Z}}(\hat{u}_{\Yx|\bm{Z}}^{A} \mid \bm{z}^{A})\right)$.

\vspace*{2pt}
\STATE{\textbf{Output}: Training sample $D^{A} = (\bm{z}^{A}, x^{A}, y^{A})$};\\ \hspace*{40pt} Test sample $D^{B} = (\bm{z}^{B}, x^{B}, y^{B})$. 
\end{algorithmic}
\label{alg:semisynthetic_data}
\end{algorithm}

Our primary workflow follows the approach outlined in \Cref{alg:semisynthetic_data}. 
First, we estimate the joint covariate-treatment density of the test data, denoted as $\hat{p}^{B}_{\ZbX}$. We then estimate the marginal causal density $\hat{p}^{B}_{\Yx}$ and the conditional copula $\hat{c}^{B}_{\YxIZb}$, capturing the covariate-outcome dependency conditional on treatment. Given covariate and treatment samples, we can calculate the causal density rank, $\hat{u}_{\Yx}$ using the conditional copula. The outcome can be calculated using the inverse transform ${\big(\hat{F}_{\Yx}^{B}\big)^{-1}}$. For the training data, we follow a similar approach. 
% A general summary of how to simulate from this workflow can be found in \Cref{alg:semisynthetic_data}.

% First, we estimate the empirical CDFs of the pretreatment covariates of the test data, denoted as $\hat{F}^{B}_{Z_d},~\forall ~d = \{1,\dots, D\}$. We then estimate the marginal causal density $\hat{p}^{B}_{\Yx}$ and the joint copula $\hat{c}^{B}_{\ZbYx}$, capturing the covariate-outcome dependency conditional on treatment. With the test copula known, we draw samples $\bm{u}_{\bm{Z}}^{B} \sim \hat{c}^{B}_{\ZbYx}$, and use inverse transforms to generate the covariate samples $z_{d}^{B} = \hat{F}_{Z_d}^{B^{-1}}(u_{Z_d}^{B})$. Next, we estimate the propensity score model for the test data, $\hat{p}^{B}_{\XIZb}$ and sample the treatment variable $x^{B} \sim \hat{p}^{B}_{\XIZb}(\cdot \mid \bm{z}^{B})$. The outcome data is calculated using $y^{B} = \hat{F}_{\Yx}^{B^{^{-1}}}(u_{\Yx}^{B})$, where $u_{\Yx}^{B}$ is the sampled outcome rank from the copula. For the training data, we follow a similar approach. A general summary of how to simulate from this workflow can be found in \Cref{alg:semisynthetic_data}. %\Cref{alg:semisynthetic_data}. %With this approach we get the semi-synthetic samples from test domain and training domain.

% First, we estimate the empirical CDFs $\hat{F}^{A}_{Z_d},~\forall ~ d = \{1, \dots, D\}$ and the covariate copula $\hat{c}^{A}_{\bm{Z}}$. We draw samples from this copula, $\hat{\bm{u}}_{\bm{Z}}^{A}$, and perform an inverse transform to generate the actual covariate samples, $\hat{z}_{d}^{A} = \hat{F}_{Z_d}^{A^{-1}}(\hat{u}_{Z_d}^{A})$.

% We then estimate the propensity score model for the training data, $\hat{p}^{A}_{\XIZb}$, and use it to sample the treatment variable, $\hat{x}^{A} \sim \hat{p}^{A}_{\XIZb}(\cdot \mid \hat{\bm{z}}^{A})$. The marginal causal rank for the training data is calculated as $\hat{u}_{\Yx}^{A}$, using the copula $\hat{c}^{B}_{\YxIZb}$ from the test data:
% \begin{align*}
%     \hat{u}_{\Yx}^{A} &\sim \hat{c}^{B}_{\YxIZb}\big( \cdot \mid \\
%     &\hat{F}_{Z_1}^{B}(\hat{F}_{Z_1}^{A^{-1}}(u_{Z_1}^{A})), \dots, \hat{F}_{Z_D}^{B}(\hat{F}_{Z_D}^{A^{-1}}(u_{Z_D}^{A})) \big).
% \end{align*}
% Finally, we perform an inverse transform to obtain the outcome samples for the training data, $\hat{y}^{A} = \hat{F}_{\Yx}^{B^{-1}}(\hat{u}_{\Yx}^{A})$, where we make sure to use the marginal causal distribution parameters and the conditional copula $\hat{c}^{B}_{\YxIZb}$ are derived from the test data to ensure the test and training CODs are identical.

% 3) Sample from copula

% 4) Draw samples from the copula, and inverse CDF transform to draw samples from Z.

% 5) Estimate propensity score

% 6) Sample treatment

% 7) Sample outcome using $y^{(i)} = \hat{F}_{\Yx}^{B}(u^{(i)}_{\Yx} \mid x^{(i)})$

% 8) Do the same for the training data, except make sure that when sampling the quantiles from the sample outcome, $u_{\Yx}^{A} \sim \hat{c}^{B}_{\YxIZb}\left( \cdot \mid u_{Z_1}^{A}, \dots, u_{Z_D}^{A} \right)$

% 9) Sample the training outcome $y^{A} = \hat{F}^{B}_{\Yx}\left( u_{\Yx}^{A} \mid x^{(i)}_{A} \right)$


\subsection{Statistical Testing}

Tests of hypotheses about high-dimensional objects have very little power if we wish to consider a wide range of alternatives.  The lower-dimensional objects can potentially increase the chance of rejection substantially if the null hypothesis fails to hold. Given that we know the marginal causal density parameterized by $p^B_{\Yx}$ from the frugal parameterization, we are able to develop statistical testing on  
$\mu^B(x)$ rather than $\mu^B(\bm{z}, x)$ for mean regression models, and $P^B_{\Yx}$ instead of $P^B_{\YxIZb}$ for distributional regression.
%$\mathcal{H}_0: \mathbb{E}\hat{\mu}(x) = \mu(x)$ instead of $\mathcal{H}_0: \mathbb{E}\hat{\mu}(x,\bm{z}) =  \mu(x,\bm{z})$ for mean regression models, and $\mathcal{H}_0: \hat{P}_{\Yx} = P_{\Yx}$ instead of $\mathcal{H}_0: \hat{P}_{\YxIZb} = P_{\YxIZb}$ for distributional regression.


Our testing algorithms require some parameters: $N_{btp}$ as the number of bootstrap iterations, $N^{A}$ and $N^{B}$ as the number of samples simulated from training domain and test domain for each bootstrap iteration, respectively. We provide the mean regression test in \Cref{alg:mean_test_algo}, but our algorithm can be extended to distributional regression models: after applying $\hat{f}$ to $D^{B}_b$, for each $i$,  we sample $\{y^j_{ib}\}_{j=1}^{N_Y}$ from the predicted distribution, $\hat{P}_{Y(x_{ib})|z_{ib}}$, and estimate marginal causal distributions such as $\hat{P}^{B}_{Y\left(x^0\right)} :=\bigcup_{b=1}^{N_{btp}} \bigcup_{i=1}^{N^{B}} \bigcup_{j=1}^{N_Y} \left\{ y_{ib}^j \mid x_{ib} = x^0 \right\}$. We then conduct distribution tests, e.g.~the Kolmogorov-Smirnov test, for $\mathcal{H}_0:  \hat{P}^{B}_{Y(x^0)}=P^{B}_{Y(x^0)}$ and get the p-value.

Our testing algorithm is flexible in the choice of testing reference, e.g.~in \Cref{alg:mean_test_algo}, we can replace $\mu^{B}(x)$ with $\tau^{B}$ as the reference target when $X$ is binary, which is what we used in our experiments. The testing method used for distributional regression models can also be replaced by other statistical tests, such as the Maximum Mean Discrepancy Test \citep{gretton2012kernel} or the Cramér-von Mises Test \citep{anderson1962distribution}.


%for distributional testing, we also need to specify $N_Y$, which is the number of outcome samples simulated from distributional regression output for each $\hat{f}(x,\bm{z})$. 

% We provide testing methods for two types of regression models: mean regression in  \Cref{alg:mean_test_algo} or distributional regression in  \Cref{alg:dist_test_algo}. Note that, in \Cref{alg:mean_test_algo}, we can replace $\mu^{B}(x)$ with $\tau^{B}$ as the reference target when $X$ is binary, which is what we used in our experiments. 



\begin{algorithm}[t]
\caption{Generalizability Evaluation on Mean Regression Models.}
\begin{algorithmic}%[1] % this prints line nubmers
% \scriptsize
\vspace*{2pt}
\STATE{\textbf{Input}:~~~~$\Theta^{A}$: parameters for training domain,\\
% \hspace*{34.3pt} $B$: number of bootstrap samples,\\
% % \hspace*{34.3pt} $\delta$: confidence level,\\
% \hspace*{34.3pt} $N^{B}$: number of $(X,Z)$ samples simulated for each bootstrap,\\
\hspace*{34.3pt} $\Theta^{B}$: parameters for test domain,\\
\hspace*{34.3pt} $\mu^{B}(x^0)$: reference.}
\vspace*{3pt}
\FOR{$b=1, \ldots, {N_{btp}}$}
    \STATE{Draw $D_b^{A}:= \{(\bm{z}'_{ib}, x'_{ib}, y'_{ib})\}_{i=1}^{N^{A}} \sim P_{\Theta^{A}}$};
    \STATE{Fit the regression model, $\hat{f}$, on $D_b^{A}$};
    \STATE{Draw $D_b^{B}:= \{(\bm{z}_{ib}, x_{ib})\}_{i=1}^{N^{B}} \sim P_{\Theta^{B}}$};
    \STATE{Apply $\hat{f}$ on $D_b^{B}$ to get predictions $\{\hat{f}(\bm{z}_{ib}, x_{ib})\}_{i = 1}^{N^{B}}$}; 
    \STATE{Calculate 
    $$\hat{\mu}_b^{B}(x^0) = \frac{\sum_{i=1}^{N^{B}} \mathbb{1}\{x_{ib}=x^0\}\hat{f}(\bm{z}_{ib}, x_{ib})}{\sum_{i=1}^{N^{B}}\mathbb{1}\{x_{ib}=x^0\}}.$$}
    % $$\hat{\mu}_b^{B}(x^0) = \frac{1}{\sum_{i=1}^{N^{B}}\mathbb{1}(x_{ib}=x^0)}\sum_{i=1}^{N^{B}} \mathbb{1}(x_{ib}=x^0)\hat{f}(x_{ib},\bm{z}_{ib})$$}.
\ENDFOR
\STATE{\textbf{end for}}
\vspace*{3pt}
% \STATE{Denote $l^{B}$, $u^{B}$ as the $(1-\delta)/2$ and $1-(1-\delta)/2$ quantiles of $\{\hat{\mu}_b^{B}(c)\}_{b=1}^B$}.
% \IF{$\mu^{B}\in \left[l^{B}, u^{B}\right]$}
\STATE{Get the p-value by conducting a t-test to compare the target parameter $\mu^{B}(x^0)$ and the distribution of $\{\hat{\mu}_b^{B}(x^0)\}_{b=1}^{N_{btp}}$}.
% \IF{$\mu^{B}\in \left[l^{B}, u^{B}\right]$}
% \STATE{\textbf{Return} True.}
% \ELSE
% \STATE{\textbf{Return} False.}
% \ENDIF
\STATE{\textbf{Return} $p$.}
\vspace*{3pt}
\end{algorithmic}
\label{alg:mean_test_algo}
\end{algorithm}

% \begin{algorithm}[t]
% \caption{Generalizability Evaluation on Distributional Regression Models.}
% \begin{algorithmic}%[1] % this prints line numbers
% \vspace*{2pt}
% \STATE{\textbf{Input}:~~$\hat{f}$: fitted distributional regression model,\\
% \hspace*{34.3pt} $B$: number of bootstrap samples,\\
% % \hspace*{34.3pt} $\alpha$: significance level,\\
% \hspace*{34.3pt} $N^{B}$: number of $(X,Z)$ samples generated in each bootstrap,\\
% \hspace*{34.3pt} $N_Y$: number of $Y$ samples simulated from distributional regression output $\hat{f}(X,Z)$,\\
% \hspace*{34.3pt} $\Theta^{B}$: parameters for test domain,\\
% \hspace*{34.3pt} $\mathbb{P}(Y|do(X=c))$: reference.}
% \vspace*{3pt}
% \FOR{$b=1, \ldots, B$}
%     \STATE{Draw sample data $D_b^{B}:= \{(X_{ib},Z_{ib})\}_{i=1}^{N^{B}} \sim P_{\Theta^{B}}$};
%     \STATE{Apply $\hat{f}$ on $D_b^{B}$ to get distributional predictions $\hat{\mathbb{P}}\left(Y|X_{ib}, Z_{ib}\right)$};
%     \STATE{For each $i$, sample $\{Y^j_{ib}\}_{j=1}^{N_Y}$ from $\hat{\mathbb{P}}(Y|X_{ib}, Z_{ib})$}.
% \ENDFOR
% \STATE{\textbf{end for}}
% \vspace*{3pt}
% \STATE{Estimate $ \smash{\hat{P}(Y \mid do(X) = c) = \bigcup_{b=1}^{B} \bigcup_{i=1}^{N^{B}} \bigcup_{j=1}^{N_Y} \left\{ Y_{ib}^j \mid X_{ib} = c \right\}}$.}
% \STATE{Conduct distribution tests, e.g., the Kolmogorov-Smirnov test, to evaluate $\mathcal{H}_0:\hat{P}(Y \mid do(X) = c) =P(Y \mid do(X) = c)$ and get p-value $p$.}
% % \IF{$p>\alpha$}
% % \STATE{\textbf{Return} True.}
% % \ELSE
% % \STATE{\textbf{Return} False.}
% % \ENDIF
% \STATE{\textbf{Return} $p$.}
% \vspace*{3pt}
% \end{algorithmic}
% \label{alg:dist_test_algo}
% \end{algorithm}
% \begin{algorithm}[t]
% \caption{Generalizability Evaluation on Distributional Regression Models.}
% \begin{algorithmic}%[1] % this prints line numbers
% % \scriptsize
% \vspace*{2pt}
% \STATE{\textbf{Input}:~~$\Theta^{A}$: parameters for training domain,\\
% % \hspace*{34.3pt} $B$: number of bootstrap samples,\\
% % % \hspace*{34.3pt} $\alpha$: significance level,\\
% % \hspace*{34.3pt} $N^{B}$: number of $(X,Z)$ samples simulated in each bootstrap,\\
% % \hspace*{34.3pt} $N_Y$: number of $Y$ samples simulated from distributional regression output $\hat{f}(X,Z)$,\\
% \hspace*{34.3pt} $\Theta^{B}$: parameters for test domain,\\
% \hspace*{34.3pt} $P^{B}_{Y(x^0)}$: reference.} \\
% % \hspace*{34.3pt} $P_{\Yx}^{B}(\cdot \mid x)$: reference.}
% \vspace*{3pt}
% \FOR{$b=1, \ldots, N_{btp}$}
%     \STATE{Sample $D_b^{A}:= \{(\bm{z}'_{ib}, x'_{ib},y'_{ib})\}_{i=1}^{N^{A}} \sim P_{\Theta^{A}}$}; 
%     \STATE{Fit the distributional regression model, $\hat{f}$, on $D_b^{A}$};
%     \STATE{Sample $D_b^{B}:= \{\bm{z}_{ib}, x_{ib})\}_{i=1}^{N^{B}} \sim P_{\Theta^{B}}$}; 
%     \STATE{Apply $\hat{f}$ on $D_b^{B}$ to get distributional predictions $\hat{P}_{Y(x_{ib})|z_{ib}}$};
%     \STATE{For each $i$, sample $\{y^j_{ib}\}_{j=1}^{N_Y}$ from $\hat{P}_{Y(x_{ib})|z_{ib}}$}.
% \ENDFOR
% \STATE{\textbf{end for}}
% \vspace*{3pt}
% \STATE{Estimate $\hat{P}^{B}_{Y\left(x^0\right)} =$}
% % \STATE{\hspace{1em} $\hat{P}(Y \mid do(X) = c) =$}
% \STATE{\hspace{2em} $\bigcup_{b=1}^{N_{btp}} \bigcup_{i=1}^{N^{B}} \bigcup_{j=1}^{N_Y} \left\{ y_{ib}^j \mid x_{ib} = x^0 \right\}$}.
% \STATE{Conduct distribution tests, e.g.~the Kolmogorov-Smirnov test, for $\mathcal{H}_0:  \hat{P}^{B}_{Y(x^0)}=P^{B}_{Y(x^0)}$ and get the p-value $p$.}
% \STATE{\textbf{Return} $p$.}
% \vspace*{3pt}
% \end{algorithmic}
% \label{alg:dist_test_algo}
% \end{algorithm}



A summary of this workflow is presented in \Cref{fig:algo-workflow}.