
In this section, we use our workflow to evaluate the generalizability of a range of modern causal models.


As discussed in several review papers like \cite{curth2021really}, \cite{ling2022critical} and 
 \cite{kiriakidou2022evaluation}, methods such as Meta-Learners (e.g.~T- and S-learners) \citep{kunzel2019metalearners}, CausalForest \citep{wager2018estimation}, TARNet \citep{shalit2017estimating}, and BART \citep{chipman2010bart} are widely used for CATE estimation, each offering advantages in different scenarios. Our evaluation focuses on their performance under covariate distribution shifts, specifically examining the accuracy of their CATE estimations. Further details about these models can be found in \Cref{sec:models}. 


Another interesting algorithm to be evaluated is engression, introduced in \cite{shen2023engression}. It approximates the conditional distribution using a pre-additive noise model. Targeting at a distributional regression, the model is capable of extrapolating to unseen or underrepresented data points through its learned non-linear transformations.  The key factors which affect engression's generalizability are the distances between two domains, and whether the true underlying function must be strictly monotonic in the extrapolation region. In our experiments, we evaluate engression in both the S-learner and T-learner settings.

\subsection{Synthetic Data}
\label{sec:synthetic}
We first conduct experiments on synthetic data to demonstrate and validate our method. While our approach can handle various data types and is particularly effective with high-dimensional covariates and continuous treatment interventions, for clarity, in this simple example, we focus on two continuous confounders, $Z_1$ and $Z_2$, sampled from identical gamma distributions, with a binary intervention $X$. We initially assume that both datasets come from  randomized controlled trials (RCT), so that $X \sim \operatorname{Bernoulli}(0.5)$ under $P^A$ and $P^B$.  We parameterize the Gaussian copula, $c_{\ZbYx}$, with Spearman correlation coefficients $\rho_{Z_1 Z_2} = 0$, $\rho_{Z_1\Yx} = 0.1$ and $\rho_{Z_2\Yx} = 0.9$. The distribution of $\Yx$  is defined as $\mathcal{N}(2x+1,1)$ in the test domain. For the simulation, we generate $N^{A} = 200$ training samples and 
$N^{B} = 50$ test samples per bootstrap, with $N_{btp}=200$ bootstraps in total, repeating this process for 50 iterations. The marginal distributions of $Z_1$ and $Z_2$ in the training domain follow identical Gamma distributions with shape $k=1$ and rate $\theta=1$.

We examine two settings: in Setting 1, the test domain has a slight covariate shift, with $Z_1$ and $Z_2$ following a Gamma distribution of $k=2$, $\theta=1$. In Setting 2, the shift is more significant ($k=4$, $\theta=1$). Despite these shifts, the COD remains the same due to frugal parameterization, as shown in \Cref{fig:synthetic}.


\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=1\linewidth]{synthetic_group.png}}
\vspace{.3in}
\caption{Synthetic Data Generated from Setting 1 (Top) and Setting 2 (Bottom). }
\label{fig:synthetic}
\end{figure}

The p-values in \Cref{fig:synthetic_mean_p} illustrate the differences across models. As expected, with a more significant domain shift in Setting 2, models face greater difficulty in generalizing, as reflected by the smaller p-values generally compared to Setting 1. T-BART and T-engression showed good generalizability performances in this specific setting with their p-values being uniformly distributed. TARNet struggles, likely due to the complexity of its representation learning network design and hyperparameter tuning.

\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=1\linewidth]{synthetic_mean_p.png}}
\vspace{.3in}
\caption{$p$-values of Mean Regression Testing, Synthetic Data of 50 Iterations.}
\label{fig:synthetic_mean_p}
\end{figure}

With our method, we are able to test the generalizability of distributional regression. \Cref{fig:synthetic_distribution_p} demonstrates the p-values of distributional regression testing of S-engression under the two settings, with $N_Y=50$. Not surprisingly, since the covariate distribution shift in Setting 1 is smaller, S-engression demonstrates better generalizability compared to that in Setting 2.

\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=1\linewidth]{synthetic_distribution_p_new.png}}
\vspace{.3in}
\caption{$p$-values of Distributional Regression Testing (Kolmogorov–Smirnov Test) of S-engression, Synthetic Data of 50 Iterations.}
\label{fig:synthetic_distribution_p}
\end{figure}

Supported by flexible simulations based on actual data, our method is useful for stress testing and model diagnostics. \Cref{fig:varying_n} shows how varying the training set size affects the generalizability of T-BART and T-engression; the performance worsen as $N^{A}$ exceeds 100. This issue may stem from problems like overfitting, but solving these problems is not our focus. Rather, our method serves as a tool to detect and highlight potential issues when making predictions on real data, which is feasible with the simulation based on actual data using the frugal parameterization. We also wish to remark on the difference between the performances of S- and T-learners. In CATE estimation, T-learners fit separate models for each treatment group while S-learners fit a single model across both, with treatment included as a feature. Hence, T-learners offer greater flexibility for modeling patient heterogeneities and it is unsurprising that they consistently outperform S-learners in our experiments.

\begin{figure}[t]
\vspace{.3in}
\centerline{\includegraphics[width=1\linewidth]{varying_n_train.png}}
\vspace{.3in}
\caption{$p$-values of Mean Regression Testing of 50 Iterations, Varying $N^{A}$, Setting 2, Synthetic Data.}
\label{fig:varying_n}
\end{figure}

Note that extrapolation performance for models like engression is typically evaluated visually, one dimension at a time. Our method, however, offers significant advantages by providing statistical evaluation of extrapolation performance in high-dimensional covariates.

% Only Gaussian copulas with a fully connected dependency structure were used in our experiments of this section. However, this framework can be generalized to pair copula constructions that allow for modeling non-Gaussian copulas with more complex dependency structures and with a range of higher dimensional covariates. Further details for each of these cases can be found in the Additional Experiments of Supplementary Material, investigating how algorithms' generalizability varies with different dependency structure.

\subsection{Real Data}
\label{sec:IHDP}
We evaluate algorithm generalizability using the Infant Health and Development Program (IHDP) dataset, a randomized experiment conducted between 1985 and 1988 to study the effect of home visits on infants' cognitive test scores~\citep{hill2011bayesian}. This dataset has become widely used in domain adaptation research \citep{curth2021really,shi2021invariant}. 

In this section, we extend the experiments presented by \citet{johansson2018learning} which train a range of causal ML algorithms on IHDP data and measure in-domain predictive performance using MSE. We extend these experiments by showing how our validation framework can be used to test out-of-domain predictive performance. Specifically, we compare the MSE metric against the p-values obtained via our proposed testing framework, highlighting how our method provides a more informative metric of whether a model can generalize robustly across different domains.


The IHDP dataset contains $T=1000$ trials, each consisting of the same 747 subjects and 25 pretreatment covariates, with the first six being continuous and the rest binary.  The potential outcomes $Y\mspace{-1mu}(1)$ and $Y\mspace{-1mu}(0)$ are provided in the data. In each trial $Y\mspace{-1mu}(x) \sim \mathcal{N}(\bm{Z}\beta_t + 4t, \, 1)$, and $\beta_t$ is randomly chosen from values $(0, 1, 2, 3, 4)$ with probabilities $(0.5, 0.2, 0.15, 0.1,0.05)$. Thus, the potential outcomes vary across trials, while the covariates, CATE and ATE remain constant.

First we treat both domains as RCTs, that is, setting the propensity score model as $X\sim \operatorname{Bernoulli}(0.5)$ for all units. The observed outcome is then $Y = X Y\mspace{-1mu}(1) + (1-X) Y\mspace{-1mu}(0)$ by consistency. We randomly select 50 trials from the 1000 available, with each trial used to create one training-test pair, and evaluate the model's generalizability on them. To introduce domain shift, we keep all covariate values identical between the training and test domains, except for $Z_1$, which is set to 1.5 times the original value in the test domain compared to the training domain. For each training-test pair, we learn the parameters following \Cref{alg:semisynthetic_data}, specifying the marginal causal distribution to follow a Gamma distribution  with its parameters estimated from the IHDP data by fitting a generalized linear model. We denote the resulting data generation distributions as $P_{\Theta^{A}}, P_{\Theta^{B}}$ for the training and test domains, respectively. We sample training data of $N^{A} = 1000$ from $P_{\Theta^{A}}$, and $N^{B} = 200$ test data from $P_{\Theta^{B}}$. The number of bootstraps is set to be $N_{btp} = 200$. Note that in our experiments, the outcomes were shifted to ensure they are strictly positive, allowing us to use the parametric form of the Gamma distribution to obtain an explicit expression for the mean.

\Cref{fig:ihdp_mean} shows the boxplot of the $\log_{10}$($p$-values) of each model and \Cref{tab:ihdp_percentage} contains the percentage of $p$-values greater than 0.05 across the $50$ trials.  T-/S-engression demonstrate better generalizability in this setting among all these methods.  We also give the result of distributional regression testing in \Cref{fig:ihdp_dist}. 


\begin{figure}[h]
\vspace{.3in}
\centerline{\includegraphics[width=1\linewidth]{ihdp_shift.png}}
\vspace{.3in}
\caption{Density of $Z_1$ of Training and Test Domains.}
\label{fig:ihdp_shift}
\end{figure}
\begin{figure}[t]
\vspace{.3in}
\centerline{\includegraphics[width=1\linewidth]{ihdp_mean_log.png}}
\vspace{.3in}
\caption{$\operatorname{log}_{10}(p\text{-values})$ of Mean Regression Testing of 50 Trials in IHDP.}
\label{fig:ihdp_mean}
\end{figure}

\begin{table}[h]
\begin{center}
\begin{tabular}{rrr}
\toprule
\textbf{Model} & \textbf{RCT} & \textbf{Non-RCT} \\
\midrule
TARNet & 0\% & 0\% \\

CausalForest & 12\% & 6\%\\

S-BART & 12\% & 8\% \\

T-BART & 12\% & 6\% \\

S-engression & 18\% & 6\%\\
T-engression & 24\% & 8\%\\
\bottomrule
\end{tabular}
\end{center}

\caption{Percentage of $p > 0.05$ across 50 Trials.} 
\label{tab:ihdp_percentage}


\end{table}


\begin{figure}[t]
\vspace{.3in}
\centerline{\includegraphics[width=1\linewidth]{ihdp_dist_new_edit.png}}
\vspace{.3in}
\caption{$p$-values of Distributional Regression Testing of 50 Trials in IHDP.}
\label{fig:ihdp_dist}
\end{figure}

% We cover two simulation scenarios: Randomized Controlled Trials (RCT) and covariate imbalances across treatment arms by introducing propensity score models. 

While we use the RCT setting as an example above to demonstrate our method, it is also applicable to observational studies. In a non-randomized setting where treatment arms are imbalanced by setting $P(X=1 \mid Z) = \operatorname{logit}(Z_2+Z_3+Z_4)$,  the percentage of $p>0.05$ across 50 trials of each algorithm is shown in \Cref{tab:ihdp_percentage}. Since our paper's focus is on providing a systematic generalizability evaluation method, we omit further analysis here. 

Although we present such percentage, all p-values, including their distributions, are highly informative. We provide guidance of interpreting the testing results in \Cref{sec:read_p}. 
% \begin{figure}[t]
% \vspace{.3in}
% \centerline{\includegraphics[width=1\linewidth]{ihdp_mean_obs.png}}
% \vspace{.3in}
% \caption{$p$-values of Mean Regression Testing across 50 Iterations, Non-randomized Study.}
% \label{fig:ihdp_mean_obs}
% \end{figure}



Note that this framework of constructing statistical tests on marginal quantities is not restricted to out-of-domain generalization testing. We adapt the original experiments in \citet{johansson2018learning}, in which the in-domain model performance was evaluated on IHDP data, and show how our framework can be easily adapted to performance evaluation for in-domain tasks. Since our method was designed for out-of-domain generalizability assessment, we do not discuss this further and leave a detailed discussion in \Cref{sec:indomain}.

Details on hyperparameters and additional experiments, including performance comparisons with or without domain shift when the CATE is known to be linear, are provided in \Cref{sec:computation_details} and \Cref{sec:linear}. 