
 In this section, we rewrite the null hypothesis from \Cref{eq:catetolnull} in terms of a \emph{signal} function that captures the bias between $\estimandobs$ and $\estimandrct$.  Then, we propose an oracle test statistic assuming that the tolerance functions $\estimandobs_\pm$ are known. Finally, we provide asymptotic guarantees for the finite-sample test statistic where the tolerance functions are estimated from the observational data.

\subsection{Null hypothesis using signal function}
We first observe that, for some tolerance functions $\estimandobs_\pm$, \Cref{eq:catetolnull} is equivalent to stating that there exists a function $g: \RR^{|\subsetx|} \to [0,1]$ such that $$\estimandobs_g(X) \defeq g \left(\subx \right)\ubobs(X) + \left(1-g \left(\subx \right)\right)\lbobs(X)$$ satisfies
$$
\pcaterct
    =   \pcateobsg, \quad \prct_{\subx}-\mathrm{a.s.}
$$
We test a slightly more restrictive hypothesis by assuming that $g$ lies in a sufficiently rich function class $\mathcal G$:
\begin{align*}
    \hnull^\GG: ~&\pcaterct
    =   \pcateobstrueg, \\ &\mathrm{for\;some}\; \trueg \in \GG,\quad \prct_{\subx}-\mathrm{a.s}.
 \end{align*}  
% where $\mathcal G$ is a class of functions from $\RR^{|J|} \to [0,1]$
% Observe that by testing $\hnull^\GG$ instead of $\hnull$ in~\Cref{eq:catetolnull}, we additionally assume that $\GG$ is a sufficiently rich function class to accurately model the dependence of the bias (captured via $\trueg$) on $\subx$.  
In practice, one can either restrict $\GG$ to a particular function class if domain knowledge is available or use neural networks as general function approximations, for which the assumption is expected to hold. 


We can then rewrite the null hypothesis above using a \emph{signal} function that captures the bias between the estimates from observational and randomized data. 
Indeed,  we have by~\Cref{asm:internalvalid}
$$
\estimandrct(x) = \EE_{\prct}\left [  Y \left(\frac{T}{\pi}-\frac{1-T}{1-\pi}\right)   \mid X = x \right ],
$$
for all $x \in \mathcal X$.
Further, recall that $Z=(X,Y,T)$ is the vector of observed variables, and thus
 by defining the signal function
\begin{align*}
   \psi_{g}(Z) &=  Y \left(\frac{T}{\pi}-\frac{1-T}{1-\pi}\right) - \estimandobs_g(X),
\end{align*}
we arrive at the null hypothesis  
\begin{align}
\label{eq:nullg}
    \hnull^{\mathcal G}:  ~&\EE_{\prct}\left [ \psi_{\trueg}(Z)\mid \subx \right ]  = 0,\\ &\mathrm{for\;some}\; \trueg \in \GG,\quad \prct_{\subx}-\mathrm{a.s.}\nonumber
\end{align}
At first glance, testing the null hypothesis in~\Cref{eq:nullg} may seem equivalent to testing equality of conditional means, a problem that has already been extensively studied~\citep{delgado1993testing,neumeyer2003nonparametric,racine2006testing,luedtke2019omnibus,muandet2020kernel}. However, we remark that this equivalence holds only if the function $\trueg$ is known, and to our knowledge, the more realistic scenario where $\trueg$ is unknown has not been previously explored in the literature. 

\subsection{Oracle test statistic}
\label{sec:oracletest}
We now derive a kernelized test statistic for the null hypothesis in~\Cref{eq:nullg}. First, we observe that the hypothesis $\hnull^\GG$ implies an infinite set of unconditional moment constraints, i.e. for any $g \in \GG$, it holds that \begin{align*}
 	&~~\EE_{\prct}\left [ \testrv_g(Z) \mid \subx \right ]  = 0,\quad \prct_{\subx}-\mathrm{a.s.}  \\&\implies \EE_{\prct}\left [ \testrv_g(Z) f(\subx)\right ] = 0,\quad \mathrm{for \;all\;measurable\;}   f.
 \end{align*}
Therefore, 
the validity of testing the RHS would carry over to the validity of testing $\hnull^\GG$. However, testing the RHS of the implication above for all measurable functions is infeasible. Instead,
we can restrict $f$ to be in a reproducing kernel Hilbert space (RKHS). The problem then becomes more tractable since it holds that
\begin{align}
\tstat^2(\psi_g) &\defeq \left(\sup _{\|f\|_\FF \leq 1}\EE_{\prct}\left [ \psi_g(Z) f(\subx)\right ] \right )^2 \label{eq:defH} \\&=  \left \|\EE_{\prct}\left [ \psi_g(Z) k(\subx,\cdot)\right ]  \right \|_\FF^2
\\&= \EE_{\prct}\left [ \psi_g(Z) k(\subx,\tilde{X}^{\J}) \psi_g(\tilde Z) \right ] \nonumber, 
\end{align}
where $k$ is a uniformly bounded reproducing kernel corresponding to an RKHS $\mathcal F$, and $\tilde Z$ is an independent copy of $Z$ following the same distribution. In particular, the null hypothesis $\hnull^\GG$ implies that $\tstat^2(\testrv_{\trueg})=0$ for some $\trueg \in \GG$, and thus we can construct a valid test based on $\tstat^2(\psi_{\trueg})$. 


\paragraph{A valid test statistic} Given i.i.d. samples $Z_i$ from $\prct$, an unbiased empirical estimate of $\tstat^2(\testrv_g)$ is the cross U-statistic~\citep{kim2024dimension},  defined as  
\begin{align*}
  &\tstathat^2(\testrv_g) \defeq \frac{2}{\nrct} \sum_{i=1}^{\nrct/2} h(Z_i ; \testrv_g), \;\text{for all}\;g \in \GG, \\ & \mathrm{with}\;\;  h(Z_i; \testrv_g) \defeq \frac{2}{\nrct} \sum_{j=\nrct/2 +1}^{\nrct} \testrv_g(Z_i) k(\subx_i, \subx_j) \testrv_g(Z_j).
\end{align*}
 The main advantage of the cross U-statistic is that, for $g=\trueg$, it is asymptotically normal under the null hypothesis $\hnull^\GG$ and weak regularity assumptions~(see Theorem~\ref{thm:main}), i.e. as $\nrct \to \infty$ it holds that
 \begin{equation*}
     \sqrt{\frac{\nrct}{2}} \;\frac{\tstathat^2(\testrv_{\trueg})}{\hat \sigma\left(\tstathat^2(\testrv_{\trueg})\right)} \to \gauss \left(0,1\right), 
 \end{equation*}
  where  $\hat \sigma \left( \tstathat^2(\testrv_{\trueg})\right)$ is the finite sample estimate of the variance term defined as  \begin{equation*}
\sigma^2\left(\tstathat^2(\testrv_{g})\right) \defeq \EE_{\prct}\left [  \left(h(Z; \testrv_{g}) -\EE_{\prct} \left[ h(Z; \testrv_{g})\right] \right)^2 \right ],
\end{equation*}
for all $g \in \GG.$
Further, observe that under the assumption that $\trueg \in \mathcal G$, we have
\begin{align}
\label{eq:objective}
\tstatopt &\defeq \underset{\g \in \GG}{\min}\left\vert \sqrt{\frac{\nrct}{2}} \frac{\tstathat^2(\psi_g)}{\hat \sigma\left(\tstathat^2(\psi_g)\right)}\right\vert\\& \leq \left\vert  \sqrt{\frac{\nrct}{2}} \frac{\tstathat^2(\psi_{\trueg})}{\hat \sigma\left(\tstathat^2(\psi_{\trueg})\right)}\right\vert.\nonumber
\end{align}
Therefore,  we can achieve validity (but possibly suffer in power) by comparing the test statistic $\tstatopt$ with the quantiles of the half-normal distribution.
 











\paragraph{Why not a classic U-statistic?}
Note that it is not clear how to test the null hypothesis $\hnull^\GG$ using a classic U-statistic~\citep{serfling2009approximation}, as done in previous works (see e.g.~\citet{hussain2023falsification,demirel2024benchmarking}). The main challenge is that under the null hypothesis $\tstat^2(\psi_{\trueg})=0$,  the U-statistic converges in distribution to a weighted $\chi^2$-statistic. However, estimating the quantiles (needed for a valid test) of this asymptotic distribution via bootstrapping requires knowing the function $\trueg$~\citep{huskova1993consistency}. 
In contrast, our test statistic $\tstatopt$ is bounded by a valid asymptotic pivot, i.e. a function of the data and the unknown function $\trueg$ whose asymptotic distribution does not depend on $\trueg$. Hence, we can compute the quantiles of the RHS in~\Cref{eq:objective} and construct an asymptotically valid test.









\subsection{Theoretical guarantees}
\label{subsec:guarantees}
Since, in practice, we do not have access to the signal function $\testrv_g$, we define the finite-sample analogous  as
\begin{equation*}
\hat \testrv_{g}(Z) =  Y \left(\frac{T}{\pi}-\frac{1-T}{1-\pi}\right) - \hatestimandobs_g(X), 
\end{equation*}
where $\hatestimandobs_g(X) \defeq  g(\subx )\hatubobs(X) + \left(1-g\left(\subx\right)\right)\hatlbobs(X)$,
and $\hat\tau_{\pm}^\obs$ is a consistent estimate of $\tau_{\pm}^\obs$ that uses only the observational data $\dataobs$. We can then define our finite-sample test statistic as
\begin{equation*}
	\tstathatopt\defeq \underset{\g \in \GG}{\min}\left\vert \sqrt{\frac{\nrct}{2}} \frac{\tstathat^2(\hat \psi_g)}{\hat \sigma\left(\tstathat^2(\hat \psi_g)\right)}\right\vert,\end{equation*} 
	and the corresponding testing function $ \hat \phi(\alpha) := \indi \left\{ \tstathatopt\geq z_{1-\alpha}\right\}$, where $z_\alpha$ is the $\alpha$-quantile of the half-normal distribution.  Below, we provide sufficient conditions for $\test$ to be an asymptotically valid test.  
\begin{thm}[Validity of the test]
\label{thm:main} 
Assume that:
\begin{enumerate}
\item[(i)] $    \mathbb E_{\prct} \left [ \testrv^2_{\trueg}(Z)~k^2(\V, \tilde{X}^\J )  ~\testrv^2_{\trueg}(\tilde Z) \right ]>0
$.
%where $\tilde Z$ is an  independent copy of $Z$ that follows the same distribution. 
\item[(ii)]
The estimates $\hatestimandobs_{\pm}$  satisfy $$  \| \estimandobs_{\pm} -  \hatestimandobs_{\pm}\|_{L^2(\mathbb \prct)}  = O_{\pobs}\left(\frac{1}{\sqrt{\nobs}}\right),$$ and it holds that 
$  \underset{\nrct,\nobs \to \infty}{\lim} \nrct/\nobs=  0.
$
\end{enumerate}\vspace{-1.5mm}
Then, we have that $$\sqrt{\frac{\nrct}{2}} \frac{\tstathat^2(\hat \psi_{\trueg})}{\hat \sigma\left(\tstathat^2(\hat \psi_{\trueg})\right)}\to \gauss(0,1), \;\;\text{as}\;\; \nrct,\nobs \to \infty.$$ 
Hence, 
$\test( \alpha)$ is  a valid asymptotic test at level $\alpha$  for the null hypothesis  $\hnull^\GG$ from Equation~\eqref{eq:nullg}.
	\end{thm} 
We refer the reader to~\Cref{apx:proofthm} for a complete proof.
\paragraph{Discussion of assumptions} Assumption~(\textit{i}) is mild and applies to very general settings, e.g. it is satisfied when $Y$ is a non-deterministic random variable. Assumption (\textit{ii}) is stronger and generally only expected to hold when $\nobs\gg\nrct$ and the support of the randomized control trial is contained in the support of the observational study, i.e.  
$\supp(\prct_X) \subseteq \supp(\pobs_X).$
These two conditions are realistic in our setting, as they align with the standard design of observational studies~\citep{franklin2019evaluating, schurman2019framework, he2020clinical}. Further, we remark that previous works either assume oracle access to the functions $\estimandobs_{\pm}$~\citep{hussain2023falsification,demirel2024benchmarking} or impose similar assumptions on the rates~\citep{de2023hidden}.

\paragraph{Power of the test}
While Theorem~\ref{thm:main} only shows asymptotic validity, we further present guarantees for the asymptotic power of the test in~\Cref{sec:power}. 
In particular, in Theorem~\ref{thm:power}, we show that under the alternative hypothesis 
\begin{equation*}
    H_A^{\GG }: \inf_{g \in \GG} \sup _{\|f\|_\FF \leq 1}\EE_{\prct} \left [ \psi_{g}(Z) f(\V)\right ]  >0,
\end{equation*}
the test statistic $\tstathatopt$ in Equation~\eqref{eq:objective} grows at the typical rate of order $\sqrt{\nrct}$ for a fixed function class $\GG$.  Thus, it yields the same asymptotic power as the existing kernel tests~(see e.g. \citet{muandet2020kernel,hussain2023falsification}).











% \Cref{thm:main} states that our test is asymptotically valid under two main caveats. First, we assume oracle access to $\lbobs$ and $\ubobs$.  However, a practitioner would run the test using an estimate of the bounds and any underlying nuisance function, such as the propensity score and outcome regression. Second, we assume we can find a global minimizer $\hatg$ of the test statistic. This is not trivial in practice as the objective can be non-convex. We show in our experiments that the validity still holds in practice, despite these two limitations. 


\subsection{A strategy for benchmarking the observational study} Given the theoretical results in this section, we can now introduce our strategy to benchmark observational studies. To do so, we first leverage both tolerance and granularity to estimate an asymptotically valid lower bound on the maximum bias for any subgroup in the observational study, that is $\proxybias \defeq \| \estimandobs -\estimandrct \|_{L^\infty(\prct)}$.


More concretely, we choose as tolerance functions $\estimandobs_{\pm}(X) = \estimandobs(X) \pm \delta$, for some constant $\delta \in \RR^+$, and  we define a data-dependent lower bound on the bias as\begin{align}
\label{eq:deltalb}
\deltalb \defeq \inf_{\delta}\{ \delta : \test(\alpha) = 0\},
\end{align}
where $\test$ depends implicitly on $\delta$ via the tolerance functions and we fix $\J= \{1,\ldots,d\}$. 
Then, under the assumptions in~\Cref{thm:main}, it holds that
$$
\mathbb P \left ( \proxybias \geq \deltalb \right) \geq 1 - \alpha + o_{\mathbb P}(1).
$$
Crucially, to benchmark the observational study, we propose to compare the lower bound on the bias against a critical value, e.g. the minimum bias strength that would explain away the estimated treatment effect in a subgroup of interest. If the lower bound is greater than the critical value, we discard the conclusions drawn from the observational study. In~\Cref{sec:rwexp}, we will show that our strategy yields conclusions consistent with current epidemiological knowledge using real-world data from the Women's Health Initiative.



\paragraph{Limitations} %A similar quantity was already proposed in~\citep{de2023hidden} to lower bound the unobserved confounding strength. %In contrast, we focus on a lower bound for the bias and show in the following sections how it can be used for benchmarking the observational study. 
We remark here that the lower bound defined in~\Cref{eq:deltalb} is optimistic, as there are two potential sources of looseness. First, a lack of power in our testing procedure can result in a lower bound far from $\proxybias$. However, at least in principle, this gap can be improved by future work on more powerful tests. Second, the bias could be arbitrarily high outside the support of the randomized trial, that is $\| \estimandobs -\cateobs \|_{L^\infty(\pobs)} > \|\estimandobs -\cateobs \|_{L^\infty(\prct)} =   \proxybias$. Unfortunately, we cannot reduce this gap without making further assumptions on the bias structure that would allow us to extrapolate beyond the randomized trial support.


 



















