\section{COPULA BACKGROUND}\label{app:copulas}
% \vspace{-1cm}

Copulas provide a powerful tool to model joint dependencies, independent of the univariate margins. This aligns well with the requirements of the frugal parameterization, where dependencies need to be varied without altering specified margins (the most critical being the specified causal effect). Understanding the constraints and limitations of copula models ensures that causal models remain accurate and consistent with the intended parameterization.

\subsection{SKLAR'S THEOREM}
Sklar's theorem \citep{sklar1959,czado2019analyzing} provides the fundamental foundation for copula modelling by providing a bridge between multivariate joint distributions and their univariate margins. It allows one to separate the marginal behaviour of each variable from their joint dependence structure, with the latter being the copula itself.

\begin{theorem}
For a d-variate distribution function $F_{1:d} \in \mathcal{F}(F_1,\ldots,F_d)$, with $j$th univariate margin $F_j$, the copula associated with $F$ is a distribution function $C : [0,1]^d \rightarrow[0,1]$ with uniform margins on $(0,1)$ that satisfies
\begin{equation*}
    F_{1:d}(\bm{y}) = C(F_1(y_1),\dots,F_{d}(y_d)), \qquad \bm{y} \in \mathbb{R}^{d}.
\end{equation*}
\begin{enumerate}
    \item If $F$ is a continuous $d$-variate distribution function with univariate margins $F_1,\dots, F_d$ and rank functions $F^{-1}_1,\dots, F^{-1}_d$ then
    \begin{equation*}
        C(\bm{u}) = F_{1:d}(F^{-1}_1(u_1),\dots,F^{-1}_d(u_d)), \qquad \bm{u}\in[0,1]^d.
    \end{equation*}
    \item If $F_{1:d}$ is a $d$-variate distribution function of discrete random variables (more generally, partly continuous and partly discrete), then the copula is unique only on the set
    \begin{equation*}
        Range(F_1) \times \dots \times Range(F_d).
    \end{equation*}
\end{enumerate}
The copula distribution is associated with its density $c(\cdot)$,
\begin{equation*}
    f(\bm{y}) = c(F_1(y_1),\dots, F_d(y_d))\cdot f_1(y_1)\cdots f_d(y_d),
\end{equation*}
where $f_i(\cdot)$ is the univariate density function of $Y_i$. 
\end{theorem}

Note that Sklar's theorem explicitly refers to the \textbf{univariate marginals} of the variable set $\{Y_1,\dots, Y_d\}$ to convert between the joint of univariate margins $C(\bm{u})$ and the original distribution $F(\bm{y})$. For absolutely continuous random variables, the copula function $C$ is unique. This uniqueness no longer holds for discrete variables, but this does not severely limit the applicability of copulas to simulating from discrete distributions.

An equivalent definition (from an analytical purview) is $C: [0, 1]^d \rightarrow [0, 1]$ is a $d$-dimensional copula if it has the following properties: 
\begin{enumerate}
    \item $C(u_1,\dots, 0, \dots, u_d) = 0$;
    \item $C(1, \dots, 1, u_i, 1, \dots, 1) = u_i$;
    \item $C$ is $d$-non-decreasing.
\end{enumerate}
\begin{definition}
    A copula $C$ is $d$-non-decreasing if, for any hyper-rectangle $H=\prod_{i=1}^{d}\left[u_i, y_i \right]\subseteq [0,1]^{d}$, the $C$-volume of $H$ is non-negative
    \begin{equation*}
        \int_{H}C(\bm{u})~d\bm{u} \geq 0.
    \end{equation*}
\end{definition}
%%%%%%%%%%%%%%%%%% NEED TO KEEP THIS IN ORDER FOR SUPP MATERIAL TO BE RENDERED WELL.

\subsection{COPULAS FOR DISCRETE VARIABLES}\label{appsub:discrete-copulas}

% \subsubsection{EMPIRICAL COPULA PROCESSES FOR DISCRETE VARIABLES}
% \subsubsection{CHALLENGES AND MOTIVATIONS}
% \label{subsubsec:discrete-copula}
Modelling the dependency between discrete and mixed data is particularly challenging, as copulas for discrete variables are not unique. Additionally, copulas encode ordering in the joint, and hence should only be used for count or ordinal data models. 
% We use the approach suggested by \citet{ruschendorf2009distributional}.
 %\label{appsub:distribtional-transform}
In order to deal with discrete variables, we use a the Generalized Distributional Transform of a random variable found originally proposed by \citet{ruschendorf2009distributional}.
% We quote the main result from \citet{ruschendorf2009distributional} below. 

\begin{theorem}
On a probability space $(\Omega, \mathcal{A}, P)$ let $X$ be a real random variable with distribution function $F$ and let $V \sim U(0, 1)$ be uniformly distributed on $(0, 1)$ and independent of $X$. The \textit{modified distribution function} $F(x, \lambda)$ is defined by
\begin{equation*}
F(x, \lambda) := P(X < x) + \lambda P(X = x).
\end{equation*}
We define the (generalized) \textit{distributional transform} of $X$ by
\begin{equation*}
U := F(X, V).
\end{equation*}
An equivalent representation of the distributional transform is
\begin{equation*}
U = F(X-) + V(F(X) - F(X-)).
\end{equation*}
\end{theorem}

\citet{ruschendorf2009distributional} makes a key remark about the generalized transform's lack of uniqueness for discrete variables. 

\subsection{PAIR COPULA CONSTRUCTIONS AND VINE COPULAS}\label{app:vinecop}
Pair copula constructions (PCCs) provide a flexible framework for modelling multivariate dependence by decomposing a high-dimensional copula into a sequence of bivariate copulas~\citep{bedford2002}. A vine copula is a specific class of PCCs that employs a graphical model to structure these pairwise dependencies, extending traditional copulas to describe complex dependency structures in high-dimensional data. 

Vine copulas allow for flexible modelling of more complex conditional dependence structures, enabling a richer representation of statistical relationships. This flexibility makes vine copulas particularly useful when modelling more complex multivariate distributions where different pairwise interaction types and conditional dependencies must be specified~\citep{czado2022vine,czado2019analyzing}.
Vine copulas extend this concept by decomposing a multivariate copula into a sequence of bivariate copulas arranged in a hierarchical structure. This decomposition enables the flexible modelling of dependencies among variables while preserving computational tractability.

There is a vast literature in showing how vines can parametrize different dependency structures, and allow for more complex and richer dependencies to be expressed using different vine tree structures and choices of copula families for each of the bivariate copulas in the vine.

The hierarchical organization of dependencies in vine copulas is achieved through a sequence of trees $\{T_1, T_2, \dots\, T_{K}\}$. Each tree consists of nodes and edges that represent variables and their dependencies, respectively. The first tree $T_1$ defines the marginal pairwise dependencies between variables. Each subsequent tree $T_k$ defines the dependencies conditional on the edges of the previous tree $T_{k-1}$. Each edge in $T_k$ is associated with a bivariate copula that models the conditional dependency between two variables. Mathematically, the joint density defined over a set of $d$ marginally uniform random variables, $c(u_1, \dots, u_d)$ of a vine copula can be expressed as:
\begin{equation}
c(u_1, \dots, u_d) = \prod_{k=1}^{d-1} \prod_{(i,j) \in E_k} c_{ij|D_{ij}}(u_i, u_j | u_{D_{ij}}),
\end{equation}
where $E_k$ represents the edges in the $k$th tree, and $D_{ij}$ denotes the conditioning set for the pair $(i, j)$.

Vine copulas model complex dependencies by combining bivariate copulas---such as Gaussian, Clayton, Gumbel, or Frank---that capture various types of correlation, including tail dependence and forms of asymmetry. The tree structure defines the choice of the order of dependence, and parameters are estimated from empirical data or assumptions. Their main strength lies in decomposing high-dimensional problems into tractable lower-dimensional components, enabling efficient sampling and inference.  It does all this while preserving computational tractability. 
% 
% The flexibility of vine copulas lies in their ability to choose different bivariate families, specify tree structures, and control dependency strengths. Each pair of variables can be modelled using a specific bivariate copula family, such as Gaussian, Clayton, Gumbel, or Frank copulas. These families allow for capturing a variety of dependency types, including tail dependencies and asymmetry. The choice of tree structure determines the order in which dependencies are modelled. The parameters of the bivariate copulas can be adjusted to represent varying levels of correlation or dependency, and these parameters are estimated based on observed data or predefined assumptions.
% 
% The primary advantage of vine copulas is their ability to model complex dependency structures while preserving computational tractability. 
% By decomposing a high-dimensional copula into a cascade of lower-dimensional components, vine copulas facilitate efficient sampling, parameter estimation, and inference. 
In our experimental framework, we leverage these properties to evaluate the impact of different dependency structures on causal inference generalizability.

\subsection{FITTING AND CUSTOMIZING FRUGAL COPULA FITS}\label{subapp:fitting-copulas}
Vine copulas allow for a great deal of flexibility for customizing complex variable dependency structures in addition to efficient method for fitting real world datasets.

\paragraph{High Dimensional Covariate Fits} For real world or semi-synthetic data examples, we recommend the use of vine copulas for higher-dimensional and more complex dependency structures. For model selection, a popular choice is the \emph{Dissmann algorithm}, which fits vine copulas iteratively from the lowest tree level upwards~\citep{dissmann2013selecting}; the choice of bivariate copula families can be performed afterwards. We use the implementation in rvinecopulib \citep{rvinecopulib2025}, which performs structure selection and optimal bivariate copula family fitting. For more flexible nonparametric alternatives, we also highlight frugal flows as a viable generative model for learning expressive causal marginals via normalizing flows, although its performance suffers in very high dimensional settings \citep{de2024marginal}.

\paragraph{Computational Efficiency of Vine Copula Fits} Fitting vine copula models is not computationally prohibitive, even in high dimensional covariate settings. To further aid reproducibility and assess feasibility, \Cref{tab:copula_fitting_times} presents the time taken to fit vine copulas (both structure and bivariate family selection) across different dimensions, using a dataset with $N=200$ and a MacBook Pro M1 Pro, 2023. The results are averaged over 10 different fits.
\begin{table}[h!]
\centering
\begin{tabular}{c r r r r r}
\hline
\multirow{2}{*}{\textbf{Sample Size}} & \multicolumn{5}{c}{$\boldsymbol{D}$}\\
& \multicolumn{1}{c}{$\mathbf{10}$} & \multicolumn{1}{c}{$\mathbf{25}$} & \multicolumn{1}{c}{$\mathbf{50}$} & \multicolumn{1}{c}{$\mathbf{100}$} & \multicolumn{1}{c}{$\mathbf{200}$} \\
\hline
% 10   & 0.13 $\pm$ 0.01 & 0.90 $\pm$ 0.25 & 3.51 $\pm$ 0.04 & 14.4 $\pm$ 0.07 & 61.1 $\pm$ 1.40 \\
10   & 0.13 $\pm$ 0.01 & 0.90 $\pm$ 0.25 & 3.5 $\pm$ 0.04 & 14 $\pm$ 0.07 & 61 $\pm$ 1.40 \\
% 25   & 0.21 $\pm$ 0.01 & 1.40 $\pm$ 0.04 & 5.73 $\pm$ 0.09 & 23.2 $\pm$ 0.07 & 94.4 $\pm$ 0.36 \\
25   & 0.21 $\pm$ 0.01 & 1.40 $\pm$ 0.04 & 5.7 $\pm$ 0.09 & 23 $\pm$ 0.07 & 94 $\pm$ 0.36 \\
% 50   & 0.37 $\pm$ 0.01 & 2.45 $\pm$ 0.05 & 10.0 $\pm$ 0.09 & 40.7 $\pm$ 0.25 & 169 $\pm$ 4.56 \\
50   & 0.37 $\pm$ 0.01 & 2.45 $\pm$ 0.05 & 10.0 $\pm$ 0.09 & 41 $\pm$ 0.25 & 169 $\pm$ 4.56 \\
% 100  & 0.71 $\pm$ 0.07 & 4.66 $\pm$ 0.09 & 19.1 $\pm$ 0.23 & 77.4 $\pm$ 0.46 & 314 $\pm$ 2.54 \\
100  & 0.71 $\pm$ 0.07 & 4.66 $\pm$ 0.09 & 19.1 $\pm$ 0.23 & 77 $\pm$ 0.46 & 314 $\pm$ 2.54 \\
200  & 1.50 $\pm$ 0.10 & 9.62 $\pm$ 0.16 & 39.4 $\pm$ 0.41 & 158 $\pm$ 3.19 & 625 $\pm$ 2.88 \\
\hline
\end{tabular}

\caption{Computation time (seconds) for vine copula fitting across dimensions $D$ and sample sizes, on a MacBook Pro M1 Pro (2023). The results were averaged over 10 different datasets, per dimension/datasize pair.}
\label{tab:copula_fitting_times}
\end{table}




\section{MODELS}
\label{sec:models}

We provide details of the models evaluated in our paper.

\paragraph{Engression} Engression, proposed in \cite{shen2023engression}, approximates the conditional distribution $Y\mid X$ using a pre-additive noise model $Y = g(WX + \eta) + \beta^\top X$, where $g: \mathbb{R}^d \rightarrow \mathbb{R}$ is a non-linear function that captures non-linear relationships and $\eta = h(\epsilon)$ introduces flexible noise. Built on a neural network architecture that efficiently learns this structure, it optimizes the energy score loss for accurate distributional regression.

\paragraph{Meta-learners}
Meta-learners are flexible frameworks in causal inference designed to estimate individualized treatment effects by leveraging machine learning models. Two common types are T-learners and S-learners. Details can be found in \cite{kunzel2019metalearners}.

T-learners work by training separate models for the treated and untreated groups, predicting outcomes under each treatment condition, and then calculating the difference between these predictions to estimate the treatment effect.
S-learners combine both treated and untreated data into a single model by including treatment as an input feature, allowing the model to learn the outcome function across both treatment conditions simultaneously.
These learners provide a modular approach to estimating conditional average treatment effects (CATE) and can adapt to different settings and model complexities.
\paragraph{CausalForest}

CausalForest is an extension of random forests designed to estimate heterogeneous treatment effects by partitioning the data into subgroups with similar treatment responses. Introduced by \cite{wager2018estimation}, it uses a tree-based ensemble method to non-parametrically estimate the CATE by building separate models for different covariate regions, while ensuring a balance between treated and control units in each partition. This method is flexible and adapts to complex data structures, making it a powerful tool for understanding treatment effect heterogeneity.

\paragraph{BART} Bayesian Additive Regression Trees, first introduced in  \cite{chipman2010bart}, is a non-parametric machine learning method that uses an ensemble of regression trees to model complex relationships between covariates and outcomes.  The BART model estimates the posterior distribution of the outcome by summing the contributions from many trees, each of which is trained to explain part of the residual error left by the others. This ensemble approach makes BART particularly effective at capturing complex, non-linear relationships between the covariates and the outcome. Unlike standard decision trees, BART applies a Bayesian framework, allowing it to quantify uncertainty in its predictions and avoid overfitting through regularization priors.

\paragraph{TARNet} Treatment-Agnostic Representation Network, first introduced in \cite{johansson2016learning}, is a neural network-based model for estimating heterogeneous treatment effects in causal inference. It works by learning a shared representation of covariates, independent of treatment assignment, and then using this representation to estimate potential outcomes for both the treated and untreated groups. By focusing on treatment-agnostic representation learning, TARNet aims to improve the generalizability and accuracy of treatment effect estimates, particularly in high-dimensional settings.

\section{COMPUTATION DETAILS}
\label{sec:computation_details}
We provide computation details in \Cref{sec:experiments}. We use default recommended hyperparameters for each model.

\begin{table}[h]
\caption{Hyperparameters of Each Model.} 
\label{tab:hyperparameter}
\begin{center}
\begin{tabular}{l|p{6cm}|p{5cm}}
\toprule
\textbf{Model} & \textbf{Key Hyperparameters} & \textbf{Package} \\
\midrule
TARNet & \begin{minipage}{5.5cm} number of layers = 2\\
batch size = 64\\
learning rate = 0.0001\\
number of epochs = 2000 \end{minipage} & \begin{minipage}{5cm} Python\\ \texttt{catenets} \citep{curth2021really} \end{minipage}\\
\cmidrule{1-3}
CausalForest & \begin{minipage}{5.5cm} number of trees = 100\\
maximum depth = 3
\end{minipage} & \begin{minipage}{5.5cm}
Python, \texttt{econml}\\ \citep{econml} \end{minipage}\\
\cmidrule{1-3}
S-/T-BART & \begin{minipage}{5.5cm} number of trees = 75\\ number of iterations = 4\\ 
number of burn-in iterations = 200\\ posterior draws = 800
\end{minipage} & R, \texttt{dbarts} \citep{dbarts} \\
\cmidrule{1-3}
S-/T-engression & \begin{minipage}{5.5cm} number of layers = 3\\ 
batch size = 64\\ 
learning rate = 0.01 \\
number of epochs = 500 \end{minipage} & \begin{minipage}{5cm}
Python, \texttt{engression}\\ \citep{engression}\end{minipage}\\
\bottomrule
\end{tabular}
\end{center}
\end{table}

All experiments were conducted on a MacBook with an Apple M3 chip, 8-core CPU, and 32GB RAM. 


\section{ADDITIONAL EXPERIMENTS}
\label{sec:additional_exp}
\subsection{IN-DOMAIN MODEL PERFORMANCE TESTING ON THE IHDP DATASET}
\label{sec:indomain}
Although our proposed method mainly tackles the out-of-domain generalizability assessment, which is a challenging task as demonstrated in \Cref{sec:generalizability_in_causal_inference},  it can be easily adapted to performance evaluation for in-domain tasks. As an illustration, we present the in-domain test results and MSE for the IHDP dataset, using the same experimental setup as in \Cref{sec:IHDP} but without introducing any domain shift (i.e.~$Z_1$ remains unchanged in the test domain) in \Cref{fig:ihdp_indomain}.

\begin{figure}[H]
\vspace{.3in}
\centerline{\includegraphics[width=0.8\linewidth]{indomain.png}}
\vspace{.3in}
\caption{$\log_{10}(\text{MSE})$ and $\log_{10}(\text{p-value})$ of Mean Regression Testing on the IHDP Dataset, No Domain Shift.}
\label{fig:ihdp_indomain}
\end{figure}

\Cref{fig:ihdp_indomain} demonstrates the contrasts of $\log_{10}(\text{MSE})$ and $\log_{10}(\text{p-value})$ performance assessment results.Each model was trained with its default hyperparameters, and we evaluated them under those same conditions. The test results therefore reflect each model’s generalizability given its default settings. As expected, we see alignments of the MSE and tests results: TARNet (with default hyper-parameter settings) exhibits large MSE, and the p-values are generally very small. Meanwhile, S-engression and T-engression yield comparatively lower MSEs; however, MSE alone can be insufficiently persuasive. By incorporating p-values and the corresponding statistical guarantees offered by our method, we can make stronger assertions about the generalizability of these two engression approaches. These findings emphasize the usefulness and significance of our proposed method in model assessment, as discussed at the end of \Cref{sec:generalizability_in_causal_inference}.

\subsection{TESTING GENERALIZABLE MODELS}
\label{sec:linear}
We include an additional experiment in this section, which is based on the synthetic data setting in \Cref{sec:synthetic}, but without domain shift. We set the marginal distribution of $Z_1$, $Z_2$ to be $\mathcal{N}(1,1)$, and $Y(X) \sim \mathcal{N}(2X+1,1)$, $X\sim \operatorname{Bernoulli} (0.5)$. In this case, the conditional average treatment effect should be linear. 

The result when there is no domain shift can be found in \Cref{fig:synthetic_mean_p_noshift}. We see that the p-values of both S-Linear (Regression) and T-Linear (Regression) are uniformly distributed. Given the true CATE function is indeed linear, this result validates our proposed method.


\begin{figure}[H]
\vspace{.3in}
\centerline{\includegraphics[width=0.5\linewidth]{synthetic_mean_p_noshift.png}}
\vspace{.3in}
\caption{$p$-values of Mean Regression Testing, Synthetic Data of 50 Iterations, No Domain Shift.}
\label{fig:synthetic_mean_p_noshift}
\end{figure}

We next test when there is domain shift, i.e.~we keep all the settings the same as above for training set, but we change the marginal distribution of $Z_1$, $Z_2$ in the test set to be $\mathcal{N}(3,2)$. \Cref{fig:synthetic_mean_p_shift} shows the results. Linear regressions still demonstrate good generalizability performance! However for algorithms like S-engression and S-BART the results worsen, likely due to problems such as overfitting.

\begin{figure}[H]
\vspace{.3in}
\centerline{\includegraphics[width=0.5\linewidth]{synthetic_mean_p_shift_linear.png}}
\vspace{.3in}
\caption{$p$-values of Mean Regression Testing, Synthetic Data of 50 Iterations, with Domain Shift.}
\label{fig:synthetic_mean_p_shift}
\end{figure}

\subsection{MORE COMPLICATED DATA GENERATION}
\label{sec:complicated_exp}
 To demonstrate the flexibility of our approach, we run additional experiments across different data generation settings, including increasing number of covariates, changing marginal distributions and changing dependency structures.
 
Tables \ref{tab:50_covariates} and \ref{tab:100_covariates} show the $\log_{10}(\text{p-value})$ statistics from 50 trials conducted under a setting similar to Synthetic Setting 1 in the main body of our paper, with only two changes: (1) we increase the number of covariates from 2 to 50 and 100, and keep the covariate distribution shifts the same for each covariate; (2) we replace the dependency structures with randomly sampled correlation matrices. CausalForest, T-engression and T-BART demonstrate good generalizability in these settings.

\begin{table}[H]
\centering
\caption{$\log_{10}(p\text{-values})$ Statistics under Synthetic Setting 1 with 50 Covariates.}
\begin{tabular}{rrrrrrr} 
\toprule 
\textbf{Model} & \textbf{Min} & \textbf{25\%} & \textbf{Median} & \textbf{Mean} & \textbf{75\%} & \textbf{Max} \\
\midrule 
TARNet & $-36.8$ & $-33.6$ & $-32.4$ & $-31.6$ & $-31.5$ & $-30.8$ \\ 
CausalForest & $-2.35$ & $-1.22$ & $-0.851$ & $-0.539$ & $-0.214$ & $-0.077$ \\ 
S-BART & $-8.32$ & $-4.22$ & $-3.36$ & $-2.58$ & $-2.66$ & $-1.92$ \\ 
T-BART & $-1.52$ & $-0.757$ & $-0.326$ & $-0.349$ & $-0.187$ & $-0.044$ \\ 
S-engression & $-20.0$ & $-18.2$ & $-17.3$ & $-14.4$ & $-16.7$ & $-13.1$ \\ 
T-engression & $-2.53$ & $-0.669$ & $-0.283$ & $-0.324$ & $-0.211$ & $-0.006$ \\ 
\bottomrule
\end{tabular}
\label{tab:50_covariates}
\end{table}

\begin{table}[H]
\centering
\caption{$\log_{10}(p\text{-values})$ Statistics under Synthetic Setting 1 with 100 Covariates.}
\begin{tabular}{rrrrrrr} \toprule 
\textbf{Model} & \textbf{Min} & \textbf{25\%} & \textbf{Median} & \textbf{Mean} & \textbf{75\%} & \textbf{Max} \\
\midrule 
 TARNet & $-34.3$ & $-31.3$ & $-30.9$ & $-29.9$ & $-30.3$ & $-29.0$ \\ 
 CausalForest & $-2.39$ & $-1.35$ & $-0.821$ & $-0.670$ & $-0.479$ & $-0.122$ \\ 
 S-BART & $-9.62$ & $-7.32$ & $-6.68$ & $-5.22$ & $-5.99$ & $-3.96$ \\ 
 T-BART & $-1.04$ & $-0.76$ & $-0.36$ & $-0.31$ & $-0.12$ & $0.00$ \\ 
 S-engression & $-28.6$ & $-26.1$ & $-25.3$ & $-23.6$ & $-24.0$ & $-22.6$ \\ 
 T-engression & $-2.46$ & $-0.663$ & $-0.393$ & $-0.366$ & $-0.137$ & $-0.107$ \\ 
 \bottomrule
 \end{tabular}
\label{tab:100_covariates}
\end{table}


\Cref{tab:non-linear} present the $\log_{10}(p\text{-values})$ statistics from 50 trials under the same setup as the first experiment in \ref{sec:linear} except for altering the marginal causal distribution. Changing this from Gaussian to gamma introduces non-linear dependencies in the conditional causal margin. While linear regression was generalizable in the original setup, it fails in the non-linear setting, demonstrating the ability of our approach to show that some methods fail to generalize well.

\begin{table}[H]
\centering
\caption{$\log_{10}(p\text{-values})$ Statistics under the Same Set-up as \Cref{fig:synthetic_mean_p_noshift} with Non-linear Dependency. }
\begin{tabular}{rrrrrrr} 
\toprule 
\textbf{Model} & \textbf{Min} & \textbf{25\%} & \textbf{Median} & \textbf{Mean} & \textbf{75\%} & \textbf{Max} \\
\midrule 
S-Linear & $-16.2$ & $-13.0$ & $-11.7$ & $-10.9$ & $-10.8$ & $-9.93$ \\ 
T-Linear & $-13.3$ & $-11.2$ & $-10.6$ & $-9.67$ & $-10.0$ & $-8.61$ \\ 
TARNet & $-24.0$ & $-21.7$ & $-21.2$ & $-19.5$ &$ -19.9$ & $-18.6$ \\ 
CausalForest & $-12.2$ & $-10.8$ & $-10.2$ & $-9.16$ & $-9.34$ & $-8.23$ \\ 
S-BART & $-13.4$ & $-10.3$ & $-9.33$ & $-7.60$ & $-8.50$ & $-6.36$ \\ 
T-BART & $-11.6$ & $-8.40$ & $-7.84$ & $-6.38$ & $-7.49$ & $-5.12$ \\ 
S-engression & $-12.5$ & $-9.89$ & $-9.45$ & $-8.01$ & $-8.17$ & $-7.13$ \\ 
T-engression & $-9.69$ & $-7.32$ & $-6.83$ & $-5.23$ & $-6.32$ & $-3.99$ \\ 
\bottomrule
\end{tabular}
\label{tab:pvalue_statistics}

\label{tab:non-linear}
\end{table}

A strength of our framework is that vine copula allows users to test their methods against various classes of copulas. We demonstrate this in \Cref{tab:non-gaussian_copula} with the following data generating process:

\begin{itemize}
    \item Training Domain: Covariates' marginal distributions are identical gamma distributions with shape $k=8$ and rate $\theta=4$;
    \item Testing Domain: Covariates' marginal distributions are identical gamma distributions with shape $k=2$ and rate $\theta=1$;
    \item Marginal Causal Distribution: Modelled as an exponential distribution with $k=0.5x+0.1$;
    \item Treatment Assignment: Specified as $ X\sim \operatorname{Bernoulli} (0.5)$;
    \item Copula: Randomly sampled R-vine structure, with each bivariate copula set to be a Clayton copula \citep{kreinovich2013clayton} with a parameter of 2. 
\end{itemize}

\begin{table}[H]
\caption{$\log_{10}(p\text{-values})$ Statistics for Experiment with a Non-Gaussian Copula. }
\centering
\begin{tabular}{rrrrrrr} 
\toprule 
\textbf{Model} & \textbf{Min} & \textbf{25\%} & \textbf{Median} & \textbf{Mean} & \textbf{75\%} & \textbf{Max} \\ 
\midrule 
S-Linear & $-\infty$ & $-5.71$ & $-4.92$ & $-3.89$ & $-4.15$ & $-2.81$ \\ T-Linear & $-\infty$ & $-3.47$ & $-2.64$ & $-1.87$ & $-1.94$ & $-0.929$ \\ TARNet & $-18.3$ & $-15.9$ & $-14.5$ & $-12.5$ & $-13.7$ & $-11.2$ \\ CausalForest & $-10.9$ & $-3.53$ & $-2.81$ & $-2.35$ & $-2.23$ & $-1.49$ \\ S-BART & $-\infty$ & $-4.12$ & $-3.47$ & $-2.91$ & $-2.99$ & $-2.04$ \\ T-BART & $-\infty$ & $-4.14$ & $-3.24$ & $-2.62$ & $-2.62$ & $-1.73$ \\ S-engression & $-15.8$ & $-4.06$ & $-3.04$ & $-2.35$ & $-2.63$ & $-1.54$ \\ T-engression & $-10.2$ & $-3.70$ & $-2.28$ & $-1.95$ & $-1.70$ & $-1.40$ \\ 
\bottomrule
\end{tabular}

\label{tab:non-gaussian_copula}
\end{table}


 \Cref{tab:gaussian_copula} shows the $\log_{10}(p\text{-values})$ of testing generalizability results with data generated from a Gaussian copula. The covariate margins, the causal margins, the dependency structure, and the second moments of each bivariate copula are identical to the previous example. We choose the rank correlation coefficient of the Gaussian copula, $
\rho = \frac{\theta}{2+\theta}$, where $\theta$ parameterizes the Clayton copula; this was set as 2 in the previous example. The only difference between the two processes is the class of the copula family. The $-\infty$ in Tables \ref{tab:non-gaussian_copula} and \ref{tab:gaussian_copula} are due to the original $p$-values being 0. 

\begin{table}[H]
\centering
\caption{$\log_{10}(p\text{-values})$ for Experiment with the Same Setting as in \Cref{tab:non-gaussian_copula}, but with a Gaussian Coupla.}
\begin{tabular}{rrrrrrr} 
\toprule 
\textbf{Model} & \textbf{Min} & \textbf{25\%} & \textbf{Median} & \textbf{Mean} & \textbf{75\%} & \textbf{Max} \\
\midrule 
S-Linear & $-\infty$ & $-5.19$ & $-4.68$ & $-3.52$ & $-3.49$ & $-2.73$ \\ T-Linear & $-\infty$ & $-2.59$ & $-1.83$ & $-1.42$ & $-1.45$ & $-0.512$ \\ TARNet & $-21.3$ & $-15.6$ & $-14.3$ & $-13.5$ & $-13.6$ & $-12.7$ \\ CausalForest & $-5.45$ & $-3.81$ & $-2.94$ & $-1.70$ & $-2.01$ & $-0.517$ \\ S-BART & $-\infty$ & $-3.48$ & $-2.85$ & $-2.13$ & $-2.09$ & $-1.20$ \\ T-BART & $-\infty$ & $-2.85$ & $-2.17$ & $-1.74$ & $-1.44$ & $-1.07$ \\ S-engression & $-11.4$ & $-3.16$ & $-2.62$ & $-1.82$ & $-1.68$ & $-1.10$ \\ T-engression & $-9.24$ & $-2.61$ & $-1.51$ & $-1.08$ & $-0.921$ & $-0.322$ \\ 
\bottomrule
\end{tabular}

\label{tab:gaussian_copula}
\end{table}

Contrasting Tables \ref{tab:non-gaussian_copula} and \ref{tab:gaussian_copula} shows that model generalizability is sensitive to copula families. Therefore, the flexibility of simulating data from different copula families, which is a key advantage of our current parametric framework, is important for model generalizability evaluation. We would also like to emphasize that in this paper we simulate from frugal models parametrically, but there are methods that which can flexibly model copulas without parametric assumptions \citep{de2024marginal}, and others may not require copulas at all.

\subsection{Equivalence Testing}
\label{sec:equiv_testing}
Our framework is flexible and naturally accommodates equivalence testing. Note that equivalence testing can be restrictive due to its need to define an additional hyperparameter---the equivalence margin---which can influence test outcomes. However, in certain applications, such as those requiring guarantees about not overlooking non-generalizable models, equivalence testing (e.g.~TOST: two one-sided tests) can be more appropriate. Here, the null hypothesis becomes $H_0$: $|\hat{\tau}^B-\tau^B|\geq \delta$, and the Type I error corresponds to the risk of falsely concluding that the model generalizes. We provide additional experiment results on the synthetic datasets. 

On synthetic data, we report TOST results for two margins, $\delta=0.1$ and $\delta=0.2$, using the same bootstrap configuration
as \Cref{alg:mean_test_algo} ($N_A=200$, $N_B=50$, $N_{btp}=200$, 50 repetitions).



% Synthetic setting 1, δ = 0.1
\begin{table}[ht]
\centering
\caption{Synthetic setting 1, $\delta = 0.1$ (TOST $p$-values)}\label{tab:tost_syn1_d01}
\begin{tabular}{lcccc}
\toprule
\textbf{Model} & \textbf{Min} & \textbf{Median} & \textbf{Mean} & \textbf{Max}\\
\midrule
TARNet        & 1.00 & 1.00 & 1.00 & 1.00 \\
CausalForest  & $1.00\times10^{-6}$ & $1.75\times10^{-4}$ & $8.49\times10^{-3}$ & 0.119 \\
S-BART       & $4.00\times10^{-6}$ & $3.88\times10^{-4}$ & $9.04\times10^{-3}$ & 0.126 \\
T-BART       & $1.83\times10^{-4}$ & $4.86\times10^{-3}$ & $3.53\times10^{-2}$ & 0.381 \\
S-engression & $3.72\times10^{-4}$ & $8.13\times10^{-2}$ & 0.133 & 0.717 \\
T-engression & $7.70\times10^{-4}$ & $2.31\times10^{-2}$ & $3.93\times10^{-2}$ & 0.226 \\
\bottomrule
\end{tabular}
\label{tab:synthetic1_d01_equiv}
\end{table}

% Synthetic setting 1, δ = 0.2
\begin{table}[ht]
\centering
\caption{Synthetic setting 1, $\delta = 0.2$ (TOST p-values)}\label{tab:tost_syn1_d02}
\begin{tabular}{lcccc}
\toprule
\textbf{Model} & \textbf{Min} & \textbf{Median} & \textbf{Mean} & \textbf{Max}\\
\midrule
TARNet        & 1.00 & 1.00 & 1.00 & 1.00 \\
CausalForest  & $5.55\times10^{-16}$ & $1.20\times10^{-10}$ & $5.10\times10^{-9}$ & $1.51\times10^{-7}$ \\
S-BART       & $1.65\times10^{-14}$ & $6.18\times10^{-11}$ & $8.02\times10^{-8}$ & $1.60\times10^{-6}$ \\
T-BART       & $2.57\times10^{-10}$ & $6.45\times10^{-8}$ & $6.25\times10^{-5}$ & $1.96\times10^{-3}$ \\
S-engression & $1.66\times10^{-11}$ & $3.31\times10^{-6}$ & $1.26\times10^{-4}$ & $2.87\times10^{-3}$ \\
T-engression & $2.76\times10^{-9}$  & $1.73\times10^{-5}$ & $2.64\times10^{-4}$ & $2.80\times10^{-3}$ \\
\bottomrule
\end{tabular}
\label{tab:synthetic1_d02_equiv}
\end{table}

 Setting $\delta =0.2$ means the null hypothesis allows for a wider range of acceptable discrepancy than $\delta = 0.1$.  As a result, we expect higher rejection rates (or smaller p-values) when $\delta = 0.2$, since the criterion for equivalence is more lenient.  Conversely, with $\delta = 0.1$, the test is stricter, and p-values are generally larger. This is validated by comparing the results in Tables \ref{tab:synthetic1_d01_equiv} and \ref{tab:synthetic1_d02_equiv}.

% Synthetic setting 2, δ = 0.1
\begin{table}[ht]
\centering
\caption{Synthetic setting 2, $\delta = 0.1$ (TOST p-values)}\label{tab:tost_syn2_d01}
\begin{tabular}{lcccc}
\toprule
\textbf{Model} & \textbf{Min} & \textbf{Median} & \textbf{Mean} & \textbf{Max}\\
\midrule
TARNet        & 1.00 & 1.00 & 1.00 & 1.00 \\
CausalForest  & $4.00\times10^{-6}$ & $1.60\times10^{-3}$ & $1.99\times10^{-2}$ & 0.336 \\
S-BART       & $1.00\times10^{-5}$ & $2.21\times10^{-3}$ & $3.08\times10^{-2}$ & 0.391 \\
T-BART       & $2.07\times10^{-4}$ & $4.54\times10^{-3}$ & $2.92\times10^{-2}$ & 0.282 \\
S-engression & $9.38\times10^{-4}$ & 0.108 & 0.191 & 0.745 \\
T-engression & $2.30\times10^{-2}$ & 0.109 & 0.162 & 0.381 \\
\bottomrule
\end{tabular}
\label{tab:synthetic2_d01_equiv}
\end{table}

% Synthetic setting 2, δ = 0.2
\begin{table}[ht]
\centering
\caption{Synthetic setting 2, $\delta = 0.2$ (TOST p-values)}\label{tab:tost_syn2_d02}
\begin{tabular}{lcccc}
\toprule
\textbf{Model} & \textbf{Min} & \textbf{Median} & \textbf{Mean} & \textbf{Max}\\
\midrule
TARNet        & 1.00 & 1.00 & 1.00 & 1.00 \\
CausalForest  & $3.64\times10^{-12}$ & $2.64\times10^{-10}$ & $1.08\times10^{-7}$ & $2.00\times10^{-6}$ \\
S-BART       & $4.10\times10^{-6}$  & $8.02\times10^{-3}$  & $4.32\times10^{-2}$ & 0.275 \\
T-BART       & $1.02\times10^{-3}$  & $2.67\times10^{-2}$  & $5.43\times10^{-2}$ & 0.329 \\
S-engression & $3.03\times10^{-6}$  & $1.36\times10^{-3}$  & $2.26\times10^{-2}$ & 0.222 \\
T-engression & $1.85\times10^{-3}$  & $4.77\times10^{-2}$  & $6.37\times10^{-2}$ & 0.214 \\
\bottomrule
\end{tabular}
\label{tab:synthetic2_d02_equiv}
\end{table}

Setting 2 involves larger domain shifts than Setting 1, making model generalizability more challenging. As expected, results in Tables \ref{tab:synthetic2_d01_equiv} and \ref{tab:synthetic2_d02_equiv} show generally higher p-values under Setting 2, reflecting the difficulty in rejecting the null hypothesis of non-equivalence and non-generalizability.


We also ran equivalence testing under the same setting as in \Cref{fig:synthetic_mean_p_noshift}, where Linear Regression models are expected to exhibit perfect transportability. Accordingly, when setting $\delta = 0.1$, we obtain the results in \Cref{tab:linear_equiv}. These are exactly as expected---models should reject the null hypothesis in this setting, confirming their strong transportability under equivalence testing.


% Linear-regression transportability (synthetic, δ = 0.1)
\begin{table}[ht]
\centering
\caption{Linear-regression transportability, synthetic ($\delta = 0.1$)}\label{tab:tost_linreg}
\begin{tabular}{lcccc}
\toprule
\textbf{Model} & \textbf{Min} & \textbf{Median} & \textbf{Mean} & \textbf{Max}\\
\midrule
S-Linear & $1.47\times10^{-13}$ & $1.47\times10^{-10}$ & $1.28\times10^{-8}$ & $2.17\times10^{-7}$ \\
T-Linear & $1.54\times10^{-11}$ & $1.10\times10^{-8}$  & $4.57\times10^{-7}$ & $9.10\times10^{-6}$ \\
\bottomrule
\end{tabular}
\label{tab:linear_equiv}
\end{table}



% The results in \Cref{tab:non-gaussian_copula} reveal differing test outcomes between the datasets, despite the only difference being the choice of copula family. These experiments serve two purposes. Firstly, they  show that a variety of different copula families can be chosen and fitted to data. Secondly, we demonstrate our framework allows for nuanced sensitivity analyses across subtly different dependencies, which is particularly important if the underlying data is heavy tailed and not appropriately modelled by a Gaussian copula.
% We illustrate the benefits of such flexibility by conducting a sensitivity analysis of the algorithms tested in this paper to different copula families. The marginal densities and the copula tree structure are kept the same. We simulated data using a randomly sampled R-vine structure for with five covariates. We set each bivariate copula to a Clayton copula with a parameter of 2, highlighting its asymmetric tail dependency, which a Gaussian copula cannot capture well~\citep{kreinovich2013clayton}.

\section{INTERPRETING TESTING RESULTS}
\label{sec:read_p}
We further explain the motivation of our paper, as well as guidance of reading the testing results.

All p-values, including their distributions, are highly informative in evaluating generalizability. For example, consistently small p-values (as shown in \Cref{fig:ihdp_mean}), indicate a clear failure of model generalizability in that setting. Conversely, uniform distributions of p-values (e.g.~linear regression results in \Cref{fig:synthetic_mean_p_noshift}) demonstrate more trust in the model’s generalizability. Type-I error control serves a critical role in distinguishing between competing hypotheses with a minimal probability of error. In our framework, controlling Type-I error ensures that conclusions about non-generalization when a model fails the test are not driven by random noise. This rigour is crucial for causal inference, where decisions based on incorrect conclusions can have significant consequences. In contrast, predictive performance measures like MSE lack statistical safeguards, and interpretations of model performance under domain shifts would lack reliability and robustness.


We also provide explanations if all tests fail. As with any hypothesis test, failing to pass provides evidence against the tested hypothesis. In our framework, this means the algorithm lacks sufficient generalizability to infer the conditional treatment margin in new domains. If all algorithms fail, it signals none are suitable for reliable causal inference under the domain shift.

This highlights the need for alternative modelling approaches and underscores the value of our framework. Unlike MSE, which compares predictive performance, our method directly identifies failures in causal generalizability—an essential insight for researchers. We hope this clarifies how to interpret such results and guides researchers in determining next steps when all models fail.

\section{Comparison with scores}
\label{sec:compare_with_MSE}

Our testing framework is actionable in that it delivers a
principled, binary decision on whether a model is
\emph{generalizable} to a given domain.  This is essential for model
selection: rather than relying on metrics such as mean square error alone (which may
favour non-generalizable models), we first use our test to \emph{filter
out} models that fail to generalize.


Our method structures selection into two stages:
\begin{description}
  \item[\textbf{Stage~1:}] Apply the proposed testing procedure to
        identify models that generalize across domains.
  \item[\textbf{Stage~2:}] Among the models that pass the test,
        rank them with a predictive metric (such as MSE) and pick the
        best-performing one.
\end{description}

This two-stage approach ensures selection is both statistically sound
and practically robust: it prioritizes generalizability before
performance.  In this framework we choose models that are
``good and generalizable,'' not merely ``relatively good'' by the score 
alone.  MSE is actionable only in the sense that it lets one compare
already viable models and hyper-parameter settings.



We do not argue against continuous score; instead, we view them and our test as
complementary.  Our test provides statistical guarantees on
generalizability.  Once generalizable models are identified, MSE can rank
their relative predictive performance.  Relying on a continuous score alone is
insufficient---a model may achieve a comparatively low score in one domain, yet fail to
generalize elsewhere.

To highlight the discrepancy, we re-ran the experiments in the paper and
recorded mean square error across 50 trials.  Tables~\ref{tab:syn1},
\ref{tab:syn2}, and~\ref{tab:real} show the minimum and maximum MSE
values and the corresponding performance ranks (lower is better) for
each model.


\begin{table}[ht]
  \centering
  \caption{Synthetic Setting~1: MSE statistics over 50 trials}
  \label{tab:syn1}
  \begin{tabular}{lcccc}
    \toprule
    \textbf{Model} & \textbf{Min} & \textbf{Max} & \textbf{Min Rank} & \textbf{Max Rank}\\
    \midrule
    TARNet         & 2.40 & 2.68 & 6 & 6 \\
    CausalForest   & 0.001 & 0.041 & 1 & 4 \\
    S-BART          & 0.040 & 0.067 & 1 & 4 \\
    T-BART          & 0.004 & 0.130 & 1 & 5 \\
    S-engression   & 0.016 & 0.144 & 2 & 5 \\
    T-engression   & 0.006 & 0.080 & 1 & 5 \\
    \bottomrule
  \end{tabular}
\end{table}

\begin{table}[ht]
  \centering
  \caption{Synthetic Setting~2: MSE statistics over 50 trials}
  \label{tab:syn2}
  \begin{tabular}{lcccc}
    \toprule
    \textbf{Model} & \textbf{Min} & \textbf{Max} & \textbf{Min Rank} & \textbf{Max Rank}\\
    \midrule
    TARNet         & 2.09  & 2.92  & 5 & 6 \\
    CausalForest   & 0.012 & 0.230 & 1 & 3 \\
    S-BART          & 0.040 & 0.150 & 2 & 5 \\
    T-BART          & 0.030 & 0.200 & 1 & 3 \\
    S-engression   & 0.120 & 0.600 & 4 & 6 \\
    T-engression   & 0.020 & 0.180 & 1 & 4 \\
    \bottomrule
  \end{tabular}
\end{table}

\begin{table}[ht]
  \centering
  \caption{IHDP: MSE statistics over 50 trials}
  \label{tab:real}
  \begin{tabular}{lcccc}
    \toprule
    \textbf{Model} & \textbf{Min} & \textbf{Max} & \textbf{Min Rank} & \textbf{Max Rank}\\
    \midrule
    TARNet         & 10.2 & 86.0 & 6 & 6 \\
    CausalForest   & 0.03  & 6.19  & 1 & 5 \\
    S-BART          & 0.03  & 6.72  & 1 & 4 \\
    T-BART          & 0.02  & 6.25  & 1 & 3 \\
    S-engression   & 0.10  & 10.16 & 2 & 5 \\
    T-engression   & 0.05  & 6.45  & 1 & 5 \\
    \bottomrule
  \end{tabular}
\end{table}

MSE and rank summaries provide no statistical confidence that a
model generalizes.  For example, S-BART's minimum MSE in Synthetic setting 2
is $0.04$, far smaller than TARNet's, yet this does not prove S-BART
generalizes.  In contrast, \Cref{fig:synthetic_mean_p} of our paper shows small $p$-values for
S-BART, letting us reject the null hypothesis of generalizability at the 5\% level. Another limitation of MSE is that in heterogeneous or endogenous noise
settings, cross-domain MSEs may diverge even with a perfectly specified
CATE model.  Differing noise levels alone can create apparent
performance gaps.
