
% In this paper, we develop a statistical method for evaluating the generalizability of causal inference algorithms using actual application data, facilitated by the frugal parameterization. Our approach introduces a semi-synthetic simulation framework that bridges the gap between synthetic simulations and real-world applications, supporting the generalizability evaluation of both mean and distributional regression models. With flexible, user-defined data generation processes, our framework provides a principled, binary decision about whether or not a model is generalizable to a specific domain. This is essential for model selection. In practice, our method helps structure the selection process into two stages:
% \begin{itemize}
%     \item Stage 1: Apply the proposed testing procedure to identify models that generalize across domains.
%     \item Stage 2: Among the models that pass the test, use a metric like MSE to choose the best-performing one.
% \end{itemize}
We make a few discussion remarks in this section.

\paragraph{Flexibility of vine copula specification} In \Cref{sec:experiments}, we present both fully simulated examples and a semi-synthetic experiments based on the IHDP dataset commonly used to validate generalization in causal inference tasks. We experimented with only Gaussian copulas with a fully connected dependency structure on a relatively small number of covariates. However, our framework can be extended to high-dimensional covariates settings and more complex dependency structures. For example, pair-copula constructions allow for flexible modeling  of non-Gaussian copulas with complex dependency structures. Additionally, one may consider using models such as Frugal Flows \citep{de2024marginal} which fit a more flexible, non-parametric generative frugal model to real world data. Further details and experiments for each of these cases can be found in \Cref{app:vinecop} and \Cref{sec:complicated_exp}. Although our approach is mainly designed for evaluation, we provide additional experiments addressing capability of our method handling model misspecification when generating semi-synthetic data (see the end of \Cref{sec:complicated_exp}).

% This two-stage approach ensures that model selection is both statistically sound and practically robust, as it prioritizes generalizability before performance evaluation. Following this framework, we select models that are ``good and generalizable'', rather than just ``relatively good'' without generalizability assessment via MSE alone. We provide more details of the comparison between our method and MSE in \Cref{sec:compare_with_MSE}.


\paragraph{Equivalence testing} While our approach of rejecting the null hypothesis shows that a model is not generalizable, it does not quantify the extent of failure. An extension of this approach may be to develop a more flexible testing method, inspired by equivalence testing \citep{wellek2002testing}. This would assess not just whether a model fails but also by how much, determining if its performance is significantly worse than a given threshold, offering a more nuanced view than traditional hypothesis testing. We provides some results in \Cref{sec:equiv_testing}. In this paper, we only consider marginal causal quantities as the validation references, but our framework can be easily adapted to use low-dimensional CODs as the reference instead with the flexibility of frugal parameterization (see \Cref{subsec:frugal-params}).

\paragraph{Validity of using low-dimensional proxy} We would also like to emphasize the objective of our method, which is to test the quality of fit of a conditional quantity against a lower dimensional marginal target instead. By introducing a low-dimensional proxy that would be identifiable under the true model, we aim to provide a quantity that is more tractable for testing, even if it sacrifices identification of a unique, correct CATE. While different CATEs can lead to the same marginal outcomes, we argue that this degeneracy is not a critical limitation in our setting. A lack of rejection simply indicates insufficient evidence that the model fails to generalize, instead of guaranteeing correctness of the CATE. We recognize that a model could fit an incorrect CATE while still producing accurate marginal outcomes. However, our empirical results suggest that such cases are rare in practice. Here, we appeal to a general result of the following form: the set of distributions where the COD fails to generalize but the marginal estimand does is a measure zero subset of distributions where the COD fails. This is analogous to the so-called ‘faithfulness’ argument for causal discovery algorithms \citep{spirtes2000causation}, or the `completeness' of d-separation. In finite samples we would need a stronger assumption (more analogous to `strong faithfulness' in \citealp{zhang2002strong}) to avoid such false negatives. This is beyond the scope of our paper. 

% We hope that this work inspires a more careful consideration of model evaluation, encourages simulations that better reflect real-world conditions, and highlights the importance of stress testing in advancing causal inference methodologies.