

We have access to two datasets: $\datarct$ of size $\nrct$  from a randomized trial~($\rct$) and  $\dataobs$ of size $\nobs$ from an observational study~($\obs$), containing tuples $Z := (X,Y,T)$ of covariates $X \in \RR^\xdim$,  bounded observed outcome $Y\in \RR$, and treatment assignment variable $T \in \{0,1\}$. We assume that the data is drawn i.i.d~from the distributions $\prct$ and $\pobs$
that are marginal distributions of the respective 
full distribution $\pfull^\diamond$
over $\left(X, U, Y(0), Y(1),Y, T\right)$ for $\diamond \in \{\rct, \obs\}$. In particular, the full distribution also includes randomness over a vector of unobserved covariates  $U\in \mathbb R^\udim$ and potential outcomes $\left(Y(0), Y(1)\right) \in \RR^2$. We further assume that the support of the randomized trial is included in the support of the observational study, i.e.
$\supp(\prct_X) \subseteq \supp(\pobs_X)$, where we use the shorthand $\mathbb P_X$ to denote the marginal distribution of $X$ under $\mathbb P$. 

\paragraph{Treatment effect estimation}
\label{sec:hte}
A crucial quantity to estimate for decision-making in many domains is the conditional average treatment effect~(CATE). The CATE  is a function   $\mu^\diamond: \XX \to \RR$ for $\diamond \in \{\rct,\obs\}$ and $\XX \subseteq \supp\left( \pxrct \right)$,   defined by
\begin{align*}
\yone^\diamond (x) \defeq \EE_{\pfull^{\diamond}}\left [  Y(1) - Y(0) \mid X=x \right ].
\end{align*}
Unfortunately, we cannot estimate the CATE from the observed data as we never observe the potential outcomes. Instead, we can estimate the regression function $\estimand^\diamond:\XX \to \RR$ for $\diamond \in \{\rct,\obs\}$ and , defined by
\begin{align*}
 \estimand^\diamond(x) \defeq \EE_{\PP^\diamond}\left [ Y \mid T=1, X=x\right ]  -\EE_{\PP^\diamond}\left [ Y \mid T=0, X=x\right ].
\end{align*}
For the treatment effect in the randomized trial, we can then observe that  $\estimandrct(x) = \caterct(x)$ holds for all $x \in \XX$, under the assumption of internal validity outlined below.
\begin{assumption}[Internal validity]
\label{asm:internalvalid}
The data-generating process of the randomized trial satisfies
 \begin{align*}
(i)&\;\;   Y = Y(T)\;\; \pfullrct -\mathrm{almost\;surely}. \\
(ii)&\;\; T \ind (Y(1),Y(0)).\\
(iii)& \;\; \pfullrct(T =1 \mid X,U)= \pi\in(0,1).
 \end{align*}
\end{assumption} 
In particular, \Cref{asm:internalvalid}
is expected to hold by design in a completely randomized experiment, and thus, $\caterct$ can be estimated from the observed data under mild assumptions~\citep{rubin1978bayesian}. On the other hand, we cannot estimate $\cateobs$ from the observed data due to hidden confounding or other sources of bias in the observational study, i.e. we cannot rule out the existence of $x \in \XX$ such that $\estimandobs (x)\neq \cateobs (x)$. Therefore, it is crucial to benchmark the observational study before using the estimate of $\estimandobs$ for any downstream task.


\begin{figure*}
 \centering 
\includegraphics[scale=0.29]{figure1} 
    \caption{High-level illustration of our approach. We want to test if the bias in the observational study, i.e. $\cateobs -\estimandobs$, is contained within a tolerance range. However, the true treatment effect $\cateobs$ is not identifiable, and instead, we test the bias between the treatment effects estimated from the two studies, i.e. $ \estimandobs -\estimandrct$.} %which, under internal validity and transportability, is equivalent to the bias in the observational study.}
    \label{fig:setting} 
\end{figure*}


\subsection{Null hypothesis}
Our goal is to test if the bias in the observational study, defined as  $ \truedelta(x) \defeq \estimandobs(x) -\cateobs(x)$ for all $x \in \XX$,
is contained within a tolerance range. However, the bias $\truedelta$ is not estimable from the data. Instead, we can test the bias $ \proxybias(x) \defeq \estimandobs(x) - \estimandrct(x)$,  which is equivalent to  $\truebias$ under internal validity and transportability, i.e. $\cateobs(x)=\caterct(x)$ for all $x \in \XX$~(see \Cref{fig:setting}).
 In particular, we would like to test if the bias $\proxybias$ between the two studies is contained within a tolerance range (requires tolerance) across all patient subgroups (requires granularity).  Hence, we will now introduce a null hypothesis that allows for both tolerance and granularity. 
 
 
To do so, we define two bounded tolerance functions $\estimandobs_{\pm}: \XX \to \RR$ that capture how much the estimated treatment effects can differ between studies and satisfy $ \estimandobs_-(x) \leq \estimandobs(x) \leq  \estimandobs_+(x) $ for all $x \in \XX$.
Further, we define the patient subgroups via a subset of features $\subx$, corresponding to the covariates with indices  $\mathcal J \subseteq \{1, \cdots, d\}$. We can then introduce our null hypothesis,  given by 
\begin{align}
\label{eq:catetolnull}
 &\quad \quad~ \hnull: \;\; \pcaterct \in \\
    &\left[\pcateobslb, \pcateobsub \right], \;\;\prct_{\subx}-\mathrm{a.s.} \nonumber \end{align}
   \paragraph{Discussion of our null hypothesis} We provide several remarks on  the null hypothesis in~\Cref{eq:catetolnull}. 
First, we satisfy tolerance by testing if $\estimandrct(x)$ is contained (in probability) in an interval around $\estimandobs(x)$, for all $x \in \XX$. Second, we can satisfy granularity by choosing an appropriate subset $\mathcal J$: When $|\mathcal J| = d$, we detect bias at the individual level, thereby satisfying the strictest definition of granularity. On the other hand, when $|\mathcal J| = 0$, we test if the average treatment effects are equal, thus potentially ignoring bias in small subgroups and individuals. Third,  we test if the treatment effects are equal (up to tolerance) on the support of the randomized trial since we cannot extrapolate outside the support of $\pxrct$ without further assumptions. 
%Below, we discuss two concrete examples of tolerance functions. 
\paragraph{Example 1: User-specified tolerance}
\label{sec:transportability}
A natural choice for the tolerance functions is to add (respectively subtract) a user-specified function $\delta(x) \geq 0$,  that is
\begin{align*}
\estimandobs_{\pm}(x) = \estimandobs (x) \pm \delta(x), \quad \mathrm{for\;all}\; x \in \XX.
\end{align*}
The function $\delta$ can incorporate all sources of bias in the observational study, such as unobserved confounding and non-adherence to treatment assignments. For instance, we can test whether the maximum bias $\|\proxybias\|_{L^\infty(\pxrct)}$ is larger than a critical value $\deltactoracle \in \RR$ by choosing as tolerance $\estimandobs_{\pm}(x) = \estimandobs (x) \pm \deltactoracle$. 
Further, similar tolerance functions have been previously used in the context of modeling violations of the transportability assumption, see e.g. \citet{nguyen2017sensitivity,nguyen2018sensitivity, dahabreh2022global,dahabreh2023sensitivity}. 




\paragraph{Example 2: Sensitivity analysis bounds}
Another practical choice for the tolerance functions $\estimandobs_{\pm}$ is to use the upper and lower bounds arising from a sensitivity analysis model. For instance, the marginal sensitivity model~\citep{tan2006distributional} is commonly used to account for unobserved confounding in observational data. In particular, this model assumes that the influence of $U$ on $T$ is limited by a \emph{confounding strength} $\Gamma$
\begin{align*}
\frac{1}{\confvalue} \leq \frac{\pfullobs(T=1|X,U)}{\pfullobs(T=0|X,U)} / \frac{\pobs(T=1|X)}{\pobs(T=0|X)} \leq \confvalue,~ \pfullobs-\mathrm{a.s.}
\end{align*}
We can thus define $\estimandobs_{\pm}$ as the upper and lower bounds for $\estimandobs$ under the assumption of $\confvalue$-bounded confounding strength. Our test can then be used to detect if the marginal sensitivity model is well-specified; see e.g.~\citet{de2023hidden} for a more detailed explanation of this setting.







 
 







