Model generalizability has garnered significant interest in causal inference \citep{bareinboim2016causal,curth2021really, johansson2018learning, buchanan2018generalizing,ling2022critical,bica2022transfer}. This encompasses transportability under covariate shifts between domains and extrapolation. In causal inference, it specifically refers to the ability of a causal model to make accurate predictions or draw valid conclusions when applied to a domain different from the one it was trained on. This concept is crucial when the objective involves understanding and predicting the effects of interventions across various settings. It holds particular importance in clinical contexts, where the 
interest in personalized treatment and patient stratification underscores the need to generalize inferences across diverse populations.

Current approaches for evaluating model generalizability generally involve using predictive metrics like AUC for classification or mean squared error for regression \citep{zhou2022domain,yu2024survey}. 
% However, these metrics do not directly answer the question of interest, that is, \emph{whether a model can or cannot generalize}.  
However, these metrics do not directly assess the evidence for whether a model generalizes across domains, nor do they provide error-controlled decision thresholds.
Does an MSE of 5 on another domain imply that the model does not generalize? How about an MSE of 1? Are these results and interpretations reproducible with statistical guarantees? How much does random noise affect these metrics? These are critical problems that should be carefully considered in causal inference questions involving multiple domains. It is essential to establish a systematic evaluation framework for generalizability performance, which offers a robust, reproducible evaluation of model performance on relevant tasks.

One approach to this problem is statistical testing, where we set the question of interest as the hypothesis we test against. However, it is difficult to obtain power against a wide-range of alternative hypotheses when performing tests conditional on a high-dimensional covariate set. This is a problem for causal practitioners as they are often interested in modeling quantities such as the individual treatment effect. 

% For generalizability in causal inference, the critical task is to determine, with controlled probability of making errors due to random noise, whether models can generalize causal insights across (task-specific) populations.
    
\paragraph{Main Contributions} We propose a systematic framework for statistically  evaluate the generalizability of high-dimensional causal inference algorithms by targeting low-dimensional causal margins. 
% Rather than relying on arbitrary metrics such as MSE, we provide a testing framework that statistically evaluates the transportability of both mean and distributional regression methods. 
Complementing existing predictive metrics such as MSE, we provide a testing framework that statistically evaluates the transportability of both mean and distributional regression methods. 

% Our method includes a semi-synthetic simulation framework using two domains---training (A) and testing (B)---that share the same intervened conditional outcome distributions, but potentially differ in their covariate and treatment distributions.  A model is trained on domain $A$ to \textbf{learn the shared high dimensional conditional outcome distribution}. We test the model's generalizability by estimating the marginal causal quantities in domain $B$, where these values are \emph{explicitly known}. This is made possible through the use of the frugal parameterization \citep{evans2024parameterizing}. Our approach simplifies the evaluation process by reducing the complexity from higher-dimensional intervened models to a lower-dimensional causal effect, enabling more powerful statistical testing.

Our method includes a semi-synthetic simulation framework using two domains, training ($A$) and testing ($B$), which have different covariate ($\bm{Z}$) and treatment ($X$) distributions, but whose \emph{conditional outcome distribution} (COD, $\Yx \mid \bm{Z}$) is assumed to be the same. First, we fit a frugally parameterized model \citep{evans2024parameterizing} to learn the COD $P_{\Yx|\bm{Z}}$ on domain $B$. The frugal parameterization allows us to obtain the \emph{marginal outcome distribution} (MOD) of $\Yx$ on domain $B$ % marginal causal density within the full joint distribution 
explicitly as part of the joint. 
% Next, we specify covariates and treatment distributions in domain $A$ to be different from domain $B$. 
We then generate semi-synthetic outcome samples of domain $A$ by applying the COD of domain $B$, while using the covariates and treatments from domain $A$. 

Next, we train the causal model of interest on these semi-synthetic samples in domain $A$, and use it to estimate marginal causal quantities for domain $B$. The model's generalizability is assessed by statistically testing its ability to recover marginal causal quantities from domain $B$ against the \emph{explicitly known} ground truth inferred earlier. By reducing the complexity from higher-dimensional to a lower-dimensional causal effect, we simplify the evaluation process, enabling more powerful statistical testing.


% With our method, we are able to derive the explicit, known values of these marginal quantities. We thus assess the generalizability of the trained model by constructing estimations and true marginal quantities. This approach simplifies the evaluation from a higher-dimensional intervened model to lower-dimensional marginal quantities, facilitating statistical testing.

% \paragraph{DAN's DRAFT EDITS} 
% A high-level overview of the workflow is as follows. We first define two domains $A$ and $B$ to be our training and test domains, respectively. A frugal model is fit to data from $B$ to learn the Conditional Outcome Distribution (COD), $p^{B}_{Y(x)|\bm{Z}}$. The choice of a frugal model allows us to explicitly parameterize the marginal causal density in domain $B$. Next, we generate semi-synthetic outcomes for domain $A$ corresponding to the COD defined in $B$: $\hat{y}_{i}^{A} \sim p^{B}_{Y(x)|\bm{Z}}(\bm{z}_{i}^{A},~x_{i}^{A})$. A model $\hat{f}$ is then fit on the semi-synthetic data from domain $A$, $\{(\bm{z}^{A},x^{A},\hat{y}^{B})\}_{i}$, and then sued to estimate outcomes on data from $B$, $\hat{y}^{B}_{i} = \hat{f}(\bm{z}^{B}_{i}, x^{B}_{i})$. These are then used to estimate marginal quantities, which are then statistically tested against the underlying truth modeled by $p^{B}_{Y(x)|\bm{Z}}$.

% A high-level overview of the workflow is as follows:
% \begin{enumerate}
%     \item \textbf{Learn both the distribution parameters of two domains, and the Conditional Outcome Distribution (COD) from real-world data}: Define two domains, domain $A$ and domain $B$, of which the covariate and treatment distributions can be different, but the COD is the same. These distributions can be learned empirically from real-world data, rather than just being limited to specifying parametric models.
    
%     \item \textbf{Model training}: Simulate semi-synthetic data from domain $A$ using the distributions fitted on data in step 1. Train a conditional effect model on the simulated data.
    
%     \item \textbf{Prediction/Estimaton}: Simulate data from domain $B$. Apply the trained model on the sampled covariates and treatments from domain $B$ and estimate marginal causal quantities outcome predictions from the model.
    
%     \item \textbf{Evaluate generalizability with statistical testing}: Statistically test whether the sampled outcomes deviate significantly from the known ground truth in domain $B$. This provides an evaluation of the model's generalizability under covariate and treatment distribution shifts. The tests assess whether the model generalizes effectively by focusing on lower-dimensional quantities instead of high-dimensional conditional outcome models.
% \end{enumerate}

The availability of exact marginal quantities in domain $B$ enables us to construct our proposed workflow.
% The proposed method builds on the availability of marginal causal quantities in domain $B$. 
In some real applications, it is usually the marginal quantities that are reported. For example, in many studies analyzing COVID-19 outcomes, researchers reported untreated outcomes, such as mortality rates or symptom progression, to contextualize treatment effects. The untreated mortality rate for severe COVID-19 in \cite{recovery2021dexamethasone} is often cited as a benchmark for evaluating interventions like dexamethasone. Our method thus  provides a simple and effective solution for assessing generalizability of an algorithm in complicated (real-world) data with statistical guarantees, including Type-I error control.
% A high-level overview of the workflow of our method is as follows. (1) We first define two domains, domain $A$ (training) and domain $B$ (testing), of which the covariate and treatment distributions can be different, but the COD is the same. These distributions can be learned empirically from real-world data, rather than just limited to specifying parametric models. (2) Next, we simuluate semi-synthetic data from domain $A$ using pre-specified distributions. Train a conditional effect model on the simulated data. (3) We then simulate data from domain $B$, whose covariate and treatment distributions may differ from domain $A$, but with an identical COD. Apply the trained model on the sampled covariates and treatments from domain $B$ and estimate marginal causal quantities outcome predictions from the model. (4) Finally, we statistically test whether the estimated marginal causal quantities deviate significantly from the known ground truth in domain $B$. This provides an evaluation of the model's generalizability under covariate and treatment distribution shifts. The tests assess whether the model generalizes effectively by focusing on lower-dimensional quantities (marginal causal distributions) instead of high-dimensional conditional outcome models.

The code used for this paper can be found in \href{https://github.com/rje42/DomainChange}{\texttt{https://github.com/rje42/DomainChange}}.

% \paragraph{Main Contributions} In this work, we propose a formal framework for statistically testing the generalizability of machine learning algorithms under covariate and treatment distribution shifts, specifically in the context of causal inference. Rather than simply relying on predictive metric scores, we provide tests that statistically evaluate the ability of both mean and distributional regression methods regarding generalizability. 
% % Our approach is built on \textbf{frugal parameterization}\cite{evans2024parameterizing}, enabling simulations from various data-generating processes as well as real-world data.  
% % In real applications, generalizability is particularly dependent on key properties from real data. For instance, sample size may affect performances of algorithms like neural networks to play the balance between in-sample performance and generalizability. Complex data structures can also play a crucial rule. This is why our simulation-based method is so important: it offers a comprehensive approach to evaluate model generalizability across diverse scenarios, providing a simple and effective solution to account for these complexities in real-world applications. 
% % In real-world applications, generalizability depends on factors such as sample sizes and the complexity of the data structures. Our proposed simulation-based method offers a comprehensive framework for quantitatively evaluating model generalizability across diverse scenarios. 
% This provides a simple and effective solution for assessing how well algorithms account for these complexities in real-world applications. 

% Consequently, we claim that our evaluation method is:
% \begin{itemize}
%     \item \textbf{\textit{Systematic}}: We offer a structured framework that allows users to easily specify and input flexible data generation processes for simulations from various data generation processes.
%     \item \textbf{\textit{Robust}}: We incorporate statistical testing to evaluate the generalizability of distributional and mean regression models, evaluating model generalizability by directly and providing statistical safeguard for decision making, which proxy predictive measures like MSE fail to do.
%     \item \textbf{\textit{Realistic}}: Simulations can be based on actual data, bridging the gap between synthetic evaluations and real-world applications.
% \end{itemize}
% Consequently, we claim that our evaluation method is \textbf{\textit{systematic}} - we offer a structured framework that allows users to easily specify and input flexible data generation processes for simulations, \textbf{\textit{comprehensive}} - our method supports simulations from various data generation processes, covering both continuous and discrete covariates and outcomes,
% % , with distributions like Gamma, Exponential, Gaussian, etc), 
% \textbf{\textit{robust}} - we incorporate statistical testing to evaluate the generalizability of distributional and mean regression models, and \textbf{\textit{realistic}} - simulations can be based on actual data, bridging the gap between synthetic evaluations and real-world applications.

