Consider a static treatment model with an outcome $Y \in \mathcal{Y}\subseteq \mathbb{R}$ and a general treatment $X$, which can be either continuous or discrete. In addition, we also make the standard causal assumptions of consistency, positivity, and conditional ignorability outlined in \citet{pearl2009causality} throughout the paper.
Let the set of measured pretreatment covariates be $\bm{Z} \in \mathcal{Z}\subseteq \mathbb{R}^{D}$. 
We then define the marginal \textit{causal} treatment density as
% 
\begin{equation*}
    p_{\Yx}(y(x)) = \int p_{\YIZbX}(y \cmid \bm{z}, x) ~ p_{\bm{Z}}(\bm{z})~d\bm{z};
\end{equation*} 
this is marginalized over the covariates.  Here $\Yx$ is the \emph{potential outcome} for $Y$ given that $X$ is set to a value $x$.

% We distinguish between the marginal \textit{conditional} treatment density which is the marginalization over the observational dataset:
% \begin{equation}
%     p_{\YIX} = \int p_{\YxIZb} ~ p_{\ZbIX}~d\bm{z},
% \end{equation}
% and the marginal \textit{causal} treatment density:
% \begin{equation}
%     p_{\Yx} = \int p_{\YxIZb} ~ p_{\bm{Z}}~d\bm{z}.
% \end{equation} which is the marginal from the randomized model.

We also use $\mu(x) = \mathbb{E}\, \Yx$ to denote the expected outcome given an intervention that sets $\{X=x\}$, and $\mu(x,z) = \mathbb{E}\left[Y(X=x)\mid Z=z\right]$ to denote the conditional expectation given covariate values. Note that $\Yx$ is essentially equivalent to $Y \mid \text{do}(X=x)$ in the notation of \citet{pearl2009causality}. When the treatment is binary, we define $\tau = \mathbb{E}[Y(1) -Y(0)]$ as the average treatment effect (ATE), quantifying the overall impact of a treatment change across the entire population. Similarly, let $\tau(z) = \mathbb{E}[Y(1) -Y(0)\mid Z=z]$ be the conditional average treatment effect (CATE), giving the result for specific subgroups or individuals, and therefore capturing treatment effect heterogeneity.

Denote the probability measures in domain A and domain B as $P^A$, $P^B$ respectively. Since our scenario requires that the conditional outcome distributions are the same we have $P^A_{\Yx\mid\bm{Z}}=P^B_{\Yx\mid\bm{Z}}$; however, since the covariate and treatment distributions may differ, the corresponding equality between the \emph{marginal} causal distributions does not necessarily hold. 

We aim to evaluate the generalizability of an outcome regression model $\hat{f}(\bm{z},x)$ that predicts the expected outcome $Y$. Predicted outcomes are denoted by $\hat{y} :=\hat{f}(\bm{z},x)$.

\subsection{Generalizability in Causal Inference}
\label{sec:generalizability_in_causal_inference}
Extensive research has focused on generalizability in causal inference, as mentioned in the introduction. 
% Recently, combining Randomized Controlled Trials (RCT) data with observational data has shown promise for improving CATE estimations in real-world settings. Calibrating outcome models with observational data helps models trained on RCTs better generalize to diverse populations \citep{curth2021really}.
As highlighted by \cite{ling2022critical}, three common approaches are used to assess treatment effect generalizability: inverse probability of sampling weighting (IPSW) methods that adjust for differences between study and target populations by weighting based on sample inclusion probabilities \citep{buchanan2018generalizing}; outcome models that estimate the conditional outcome directly \citep{kern2016assessing}; and hybrid approaches that combines both \citep{dahabreh2019generalizing}.

In this paper, we focus on algorithms that generalize conditional outcome predictions across different domains, enabling accurate CATE or COD estimation. This is crucial for understanding individual-level treatment effect heterogeneity and ensuring that models can adapt to new populations or environments with varying covariate distributions. A summary of common CATE estimation methods is provided by \cite{caron2022estimating}.

% Despite advancements in CATE estimation, a systematic framework for evaluating generalizability is still underdeveloped. Commonly current methods, like MSE and Precision in Estimation of Heterogeneous Effect (PEHE), provide limited real-world insights \citep{curth2021really,kiriakidou2022evaluation}. 

% \paragraph{DAN'S EDITS} 
Despite advancements in CATE estimation, a systematic framework for evaluating generalizability remains underdeveloped. For example,  \citet{johansson2018learning} validate their model using both simulated and real world data. The simulated data examples assess predictive generalizability with MSE in the absence of any treatment mechanism, making causal verification impossible. Additionally, their analysis of the IHDP dataset \citep{hill2011bayesian} does not involve covariate or treatment shifts, so it does not effectively test generalizability. Another relevant paper is \citet{shi2021invariant}, which measures out-of-domain generalization performance using the mean absolute error (MAE). While their method achieves the lowest MAE among competitors,  there is no formal criterion to determine whether a specific MAE value signifies sufficient generalization to a new domain.

We highlight these issues not as criticisms of the papers, but to emphasize that robust generalizability evaluation methods of causal models are missing and challenging. Furthermore, existing benchmarks like IHDP are not specifically designed for out-of-domain generalization tests. To address this gap, we propose a systematic semi-synthetic framework to evaluate how well CATE algorithms perform across domains with different covariate distributions, offering a more practical assessment of whether a given approach will generalize well. In \Cref{sec:IHDP}, we adapt the IHDP experiments presented in \citet{johansson2018learning} and extend them by generating datasets from different domains, while making the marginal quantity explicitly known. Furthermore, we contrast the predictive MSE scores with the p-values derived from our tests to show how the latter provides a more actionable metric for whether a model successfully generalizes or not.
% We demonstrate its implementation on the IHDP data in \Cref{sec:IHDP}.

\subsection{Frugal Parameterization}\label{subsec:frugal-params}
A frugal parameterization of an observational joint distribution, $P_{\ZbXY}$, factorizes the distribution into a set of causally relevant components~\citep{evans2024parameterizing}. This decomposition explicitly parameterizes the marginal causal distribution, $P_{\Yx}$, or other lower dimensional causal distribution $P_{\Yx|\bm W}, \bm W\subset \bm Z$, and builds the rest of the model around it. Frugal models require that the three usual assumptions for causal inference (consistency, positivity, no unmeasured confounding) in addition to any additional regularity assumptions (further details can be found in Appendix A of \citet{evans2024parameterizing}).

Let us start by first parameterizing the \textit{conditional outcome distribution} (COD), $P_{\YxIZb}$. Frugal models can parameterize the COD in terms of the marginal causal distribution, $P_{\Yx}$, and a conditional copula distribution, $C_{\YxIZb}$. Here, $C_{\YxIZb}$ models the joint dependency between the marginal causal distribution and each of the univariate marginal covariate distributions, $\{P_{Z_i}\}_{i}$ such that
\begin{equation*}
    p_{\YxIZb} = p_{\Yx} \cdot  c_{\YxIZb},
\end{equation*}
where $c_{\YxIZb}$ is  a copula density function that parameterizes the dependence between $\Yx$ and the covariates. Multivariate copulas, particularly when parameterized using pair copula constructions or vine copulas~\citep{czado2022vine}, offer a rich flexible framework for modeling complex multivariate distributions, whilst also capturing (or allowing the user) to encode specific dependency constraints in the target data generating process. See \Cref{app:copulas} for further details on copulas and how they can be be fitted to real-world datasets.
% \begin{equation}
%     C_{\YxIZb} := C\left(F_{\Yx} \mid F_{Z_1},\dots,~F_{Z_{D}} \right).
% \end{equation}
% We present a summary of copulas in the appendix for unfamiliar readers, but in short, copulas provide a framework for encoding dependencies between marginal quantities in such a way that the marginal distributions are preserved.

This leaves the distribution of the \textit{past}, %$P_{\bm{Z}X}$, 
i.e.~the covariate distribution and the propensity score. We assume that all covariates are strictly pretreatment, so $\bm{Z}$ does not include any mediators of the causal effect of $X$ on $Y$. If we use a conditional copula then the past and the COD are variation independent, in the sense that they parameterize separate, non-overlapping aspects of the joint distribution.
% ~\citep{evans2024parameterizing}. 
This allows the past to be freely specified without affecting either the conditional copula, or the marginal causal distribution. 

The frugal parameterization also allows us to chose a conditional estimand.  For example, if we were interested in a conditional average treatment effect given $\bm{W} \subset \bm{Z}$, we could write $p_{\Yx|\bm{Z}} = p_{\Yx|\bm{W}} \cdot c_{\Yx|\overline{\bm{Z}}; \bm{W}}$ where $\overline{\bm{Z}} = \bm{Z}\setminus\bm{W}$.  Here $c_{\Yx|\overline{\bm{Z}}; \bm{W}}$ is a pair-copula between $\Yx$ and $\overline{\bm{Z}}$ conditional upon $\bm{W}$. This enables us to condition on a small subset of covariates that we consider to be particularly important in terms of predicting the outcome.