

In this section, we evaluate our test and the resulting bias lower bound in finite-sample semi-synthetic experiments.

\subsection{Experimental setting}

\paragraph{Dataset} We evaluate our testing procedure on a semi-synthetic dataset derived from a real-world randomized trial: Hillstrom's MineThatData Email dataset \citep{hillstrom2008}. The Hillstrom dataset contains records of 64,000 customers who made purchases online within the last twelve months. We consider a combined treatment group, which constitutes approximately 66\% of the dataset, and a control group. The outcome represents the dollars spent in the two weeks post-campaign. The dataset provides information on individual annual spending, newcomer status, and geographical location, among others. We normalized continuous features and one-hot-encoded categorical features, resulting in a 13-dimensional dataset. By default, we use 80\% of the full dataset as the observational study ($\obs$) and the remaining 20\% as the randomized trial ($\rct$).

\paragraph{Bias model} We consider three different models for the bias between studies, given by $\truebias(x)=\cateobs(x)-\estimandobs(x)$, for all $x \in \XX$. In Scenario 1, we consider one subgroup with a constant bias of $\truebias=60$, while the rest of $\obs$ remains unbiased. In Scenario 2 (\Cref{fig:heatmap_scenario2}), we add biases of varying magnitudes across 12 subgroups defined by combinations of the binary features \texttt{newbie} and \texttt{mens} and the categorical feature \texttt{channel}. The largest bias is $\truebias=60$, and it affects only 12\% of the observational dataset.
The subgroup biases roughly cancel each other out on average, resulting in an average bias close to zero, i.e. $\EE_{\pobs}[ \truebias(X)] \approx 0$. Finally, in Scenario 3 (\Cref{fig:heatmap_scenario3}), we model the bias as a quadratic polynomial of the feature \texttt{history}, sampling different coefficients for the two values of \texttt{newbie}. 



\paragraph{User-defined tolerance and baselines} 
We refer to the testing function proposed in this paper as $\catetest$, and we instantiate it using constant upper and lower bounds for the tolerance function, as described in Example 1 from~\Cref{sec:hte}~($\estimandobs_{\pm}(X) = \estimandobs(X) \pm \delta$ for some constant $\delta \in \RR^+$). We compare our test against $\atetest$, which is a slight modification\footnote{$\atetest$ is a t-test for the null hypothesis that average treatment effects between the studies differ at most $\delta$.} of the test with tolerance proposed in~\citep{de2023hidden}.  For both tests, we can compute the lower bound on the bias $\deltalb$, as defined in~\Cref{eq:deltalb}. Note that while our method allows us to select a subset of features $\subx$ that are interesting for the treatment effect heterogeneity, we use the full feature set in all our semi-synthetic experiments. We thus show the effectiveness of our test even when considering a relatively large set of features, and we expect power to improve when considering a smaller subset; see, e.g. the ablation studies for $\subx$ in \Cref{sec:ablation_subset}.


\paragraph{Implementation} We use the Laplacian kernel with a scale of 1.0 to compute our test statistic $\catetest$. We perform gradient descent for 6000 epochs using the \texttt{Adam} optimizer from the JAX-based library \texttt{optax} with its default hyperparameters and record the smallest test statistic. As function class $\GG$, we consider linear functions and two multilayer perceptrons (MLPs), one \textit{small} and one \textit{large}, with hidden layer widths of 10 and 100-50-10-5 neurons, respectively. For the linear function and the small MLP, we set the learning rate to 0.1, and for the large MLP, we set it to 0.01. For the test  $\atetest$, we use 500 bootstrap samples to estimate the variance.


\subsection{Experimental results}
We now discuss our experimental results, depicted in \Cref{fig:hillstrom_exp}. We first conduct ablation studies for Scenario 1, with only one subgroup having a constant bias of $\delta^*=60$. We study the effect of the biased subgroup size (\Cref{fig:abl_bias_group}) and the randomized trial sample size (\Cref{fig:abl_rct_size}) on the lower bounds $\deltalb$ obtained from our test $\catetest$ and the baseline $\atetest$. Next, we assess the validity and power of our test $\catetest$ in two more complex settings: Scenario 2 (\Cref{fig:abl_function_class}) and Scenario 3 (\Cref{fig:abl_function_class_hard}).  An important consideration is the selection of the function class $\GG$ in practice; it should be sufficiently large to contain $\trueg$, but overly large function classes may result in a more complex optimization problem. Thus, we also conduct ablation studies for $\GG$. Our results show that granularity significantly improves the power of the test and, consequently, the estimated lower bound on the bias: $\catetest$ consistently outperforms the baseline across all scenarios and demonstrates robustness w.r.t. the choice of function class in the ablation studies. 

\begin{figure*}
\centering
    \begin{subfigure}[b]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{abl_biased_group.pdf}
        \caption{Scenario 1
        }
      \label{fig:abl_bias_group}
    \end{subfigure}
    \hfill
    \begin{subfigure}[b]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{abl_rct_size.pdf}
        \caption{Scenario 1}
        \label{fig:abl_rct_size}
    \end{subfigure}
    \hfill
    \begin{subfigure}[b]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{abl_function_class_1.pdf}
        \caption{Scenario 2}
        \label{fig:abl_function_class}
    \end{subfigure}
    \hfill
    \begin{subfigure}[b]{0.245\textwidth}
        \centering
        \includegraphics[width=\textwidth]{abl_function_class_2.pdf}
        \caption{Scenario 3}
        \label{fig:abl_function_class_hard}
    \end{subfigure}\caption{
For all the plots: the significance level is set at $\alpha=0.05$, $\phi^\star$ denotes the oracle test, which rejects for $\delta<\truedelta$. (a-b) Scenario 1, comprising a single subgroup with a constant bias $\truedelta=60$: we plot the bias lower bound $\deltalb$ as a function of (a) the biased subgroup percentage w.r.t. total sample size and (b) the randomized trial sample size. (c-d) Probability of rejection for different function classes $\GG$ as a function of the user-specified tolerance $\delta$ for (c) Scenario 2 (\Cref{fig:heatmap_scenario2}) based on 12 subgroups with different biases and (d) Scenario 3 (\Cref{fig:heatmap_scenario3}) based on a quadratic polynomial bias. We report mean and standard error over 5 runs. The coefficients for the polynomial bias are fixed across runs. 
}
    \label{fig:hillstrom_exp}
\end{figure*}
\paragraph{Effect of biased subgroup and rct sample sizes} \Cref{fig:abl_bias_group} shows that our test yields an average lower bound $\deltalb$ smaller and close to the true maximum bias $\truedelta$. This implies that the test remains valid and exhibits significant power, even when the biased subgroup represents roughly 14\% of the observational dataset. In contrast, $\atetest$  experiences a significant drop in power as the proportion of biased data points decreases. Such behavior is expected since $\atetest$ only tests for the difference of averages, and it cannot detect bias in small subgroups, i.e. it is not granular. In \Cref{fig:abl_rct_size}, we add a constant bias of 60 to 44\% of the observational data points and study the effect of the randomized trial sample size. While our test suffers more than $\atetest$ from a decrease in the sample size due to the use of kernels, it always yields higher power, even in the very small sample size regime with 70 data points. These results show the importance of granularity: even in simple settings, $\atetest$  can fail to flag significantly biased datasets, in contrast to our method.

\paragraph{Validity and power in complex scenarios} 
\Cref{fig:abl_function_class} and \Cref{fig:abl_function_class_hard} show the validity and power of our testing procedure for Scenario 2 (see~\Cref{fig:heatmap_scenario2}) and Scenario 3 (see~\Cref{fig:heatmap_scenario3}), respectively. In both scenarios, if we use a neural network to approximate the bias function, our test remains valid and shows very high power since it rejects the null hypothesis at values of $\delta$ close to the true bias $\truedelta$.  
\paragraph{Effect of misspecified function class}  Notably, when $g$ is modeled with a linear function, our test loses its validity, rejecting values of $\delta$ that are larger than the true bias. Such behavior is expected as the chosen function class $\GG$ lacks the complexity necessary to capture the true bias model. Nevertheless, we observe that the \textit{small} network with one hidden layer is already sufficient. Further, significantly increasing the complexity -- the \textit{large} network has approximately 45 times more parameters than the \textit{small} one -- still yields high power. Therefore, we recommend practitioners to be conservative in their choice of function class to ensure validity, even if it might come at the potential cost of some power and a more complex optimization problem. Moreover, although we cannot guarantee convergence to a global optimum, given the non-convexity of the problem for complex function classes, we show that the optimization procedure is stable and consistently reaches the same solution in~\Cref{apx:opt}.
 




        