
In this paper, we develop a statistical method for evaluating the generalizability of causal inference algorithms using actual application data, facilitated by the frugal parameterization. Our approach introduces a semi-synthetic simulation framework that bridges the gap between synthetic simulations and real-world applications, supporting the generalizability evaluation of both mean and distributional regression models. With flexible, user-defined data generation processes, our framework provides a principled, binary decision about whether or not a model is generalizable to a specific domain. This is essential for model selection. In practice, our method helps structure the selection process into two stages:
\begin{itemize}
    \item Stage 1: Apply the proposed testing procedure to identify models that generalize across domains.
    \item Stage 2: Among the models that pass the test, use a metric like MSE to choose the best-performing one.
\end{itemize}


This two-stage approach ensures that model selection is both statistically sound and practically robust, as it prioritizes generalizability before performance evaluation. Following this framework, we select models that are ``good and generalizable'', rather than just ``relatively good'' without generalizability assessment via MSE alone. We provide more details of the comparison between our method and MSE in \Cref{sec:compare_with_MSE}.

Through experiments on the synthetic and IHDP datasets, we assess the generalizability of algorithms such as TARNet, CausalForest, S-/T-BART, and S-/T-engression under domain shift. Our method acts as a valuable diagnostic tool, allowing us to explore how factors like training set size or covariate shifts impact generalizability. These insights can help identify model strengths and weaknesses and inform how causal inference models adapt to different settings.

% In \Cref{sec:experiments}, we present both fully simulated examples and a semi-synthetic experiments based on the IHDP dataset commonly used to validate generalization in causal inference tasks. We experimented with only Gaussian copulas with a fully connected dependency structure on a relatively small number of covariates. However, our framework can be extended to high-dimensional covariates settings and more complex dependency structures. For example, pair-copula constructions allow for flexible modeling  of non-Gaussian copulas with complex dependency structures. Further details and experiments for each of these cases can be found in \Cref{app:vinecop} and \Cref{sec:complicated_exp}. Although our approach is mainly designed for evaluation, we provide additional experiments addressing capability of our method handling model misspecification when generating semi-synthetic data (see the end of \Cref{sec:complicated_exp}).

% While our approach of rejecting the null hypothesis shows that a model is not generalizable, it does not quantify the extent of failure. An extension of this approach may be to develop a more flexible testing method, inspired by equivalence testing \citep{wellek2002testing}. This would assess not just whether a model fails but also by how much, determining if its performance is significantly worse than a given threshold, offering a more nuanced view than traditional hypothesis testing. We provides some results in \Cref{sec:equiv_testing}. In this paper, we only consider marginal causal quantities as the validation references, but our framework can be easily adapted to use lower dimensional CODs as the reference instead with the flexibility of frugal parameterization (see \Cref{subsec:frugal-params}).

% We would also like to emphasize the objective of our method, which is to test the quality of fit of a conditional quantity against a lower dimensional marginal target instead. By introducing a low-dimensional proxy that would be identifiable under the true model, we aim to provide a quantity that is more tractable for testing, even if it sacrifices identification of a unique, correct CATE. While different CATEs can lead to the same marginal outcomes, we argue that this degeneracy is not a critical limitation in our setting. A lack of rejection simply indicates insufficient evidence that the model fails to generalize, instead of guaranteeing correctness of the CATE. We recognize that a model could fit an incorrect CATE while still producing accurate marginal outcomes. However, our empirical results suggest that such cases are rare in practice. Here, we appeal to a general result of the following form: the set of distributions where the COD fails to generalize but the marginal estimand does is a measure zero subset of distributions where the COD fails. This is analogous to the so-called ‘faithfulness’ argument for causal discovery algorithms \citep{spirtes2000causation}, or the `completeness' of d-separation. In finite samples we would need a stronger assumption (more analogous to `strong faithfulness' in \citet{zhang2002strong}) to avoid such false negatives. This is beyond the scope of our paper. 

We hope that this work inspires a more careful consideration of model evaluation, encourages simulations that better reflect real-world conditions, and highlights the importance of stress testing in advancing causal inference methodologies.