%In this section, we discuss related works that combine randomized and non-randomized data either to detect flawed observational studies or to obtain more reliable treatment effect estimates.

%\paragraph{Statistical tests based on average treatment effects}
Given the challenges associated with estimating treatment effects using non-randomized data, several works propose to detect bias in the treatment effect estimated from observational data by leveraging randomized trials~\citep{viele2014use,yangelastic, morucci2023double, gao2023pretest}, multiple observational studies~\citep{karlsson2023detecting,mameche2024identifying}, or negative controls~\citep{lipsitch2010negative,donald2014testing,de2014testing,sofer2016negative}. In particular, when randomized data is available, they introduce statistical tests for the null 
\begin{align}
\label{eq:atetest}
\hnullate: \EE_{\pxrct}\left [\estimandrct(X)\right] = \EE_{\pxrct}\left[\estimandobs (X)\right].
\end{align}
Rejecting $\hnull$ implies that either the treatment effect estimate from the observational study is biased or the transportability assumption is violated, i.e.  $\caterct(x) \neq \cateobs(x)$ for some $x \in \XX$. However, the null hypothesis in~\Cref{eq:atetest}  does not allow tolerance or granularity. Thus, it suffers from two major limitations: it rejects observational studies with negligible treatment effect bias %This approach can be too restrictive in real-world settings, where some bias is likely present.
and it cannot detect bias in small subgroups or individuals. %, i.e. bias may cancel out on average, leading to flawed studies being accepted. 
   In the following, we present existing statistical tests designed to offer either tolerance or granularity, and describe how our method generalizes them.

% However, a major limitation of testing the null hypothesis in~\Cref{eq:atetest} is that, even in infinite samples, we reject observational studies with negligible treatment effect bias. This approach can be too restrictive in real-world settings, where some bias is likely present.

 \paragraph{Statistical tests with tolerance} 
One way to address the restrictiveness of previous statistical tests and reduce false rejections is to incorporate some tolerance. More formally,  given some user-specified tolerance functions $\estimandobs_{\pm}$, \citet{yangelastic,de2023hidden}
 propose a test for the null hypothesis
\begin{align*}
\hnull: \EE_{ \pxrct}\left[\estimandrct(X)\right] \in \left[ \EE_{ \pxrct}\left[\lbobs(X)\right], \EE_{\pxrct}\left[\ubobs(X)\right]\right],
\end{align*}
where $\lbobs(x) \leq \estimandobs(x) \leq \ubobs(x)$ for all $x \in \XX$. For instance, if we choose sensitivity analysis bounds as tolerance functions and assume transportability, we can test for the presence of unobserved confounding above a certain strength.  However, current statistical tests with tolerance are not granular: large biases in small subgroups can remain undetected. In contrast, our null hypothesis in~\Cref{eq:catetolnull} allows granularity and recovers existing tests with tolerance when the subset of features $\mathcal J = \emptyset$.

% However, a limitation of statistical tests based on the null hypotheses in~\Cref{eq:atetol,eq:atetest} is that they are not granular, i.e. they cannot detect bias on a subgroup or individual level. In particular,  bias may cancel out on average, leading to flawed studies being accepted.  
% In contrast, our null hypothesis in~\Cref{eq:catetolnull} also satisfies granularity and is more general, i.e. it recovers existing tests with tolerance when the subset of features $J$ is the empty set.

 
\paragraph{Statistical tests with granularity}
Several works have addressed the lack of granularity. \citet{hussain2022falsification} compare group-level treatment effects using pre-specified subgroups; however, this approach suffers from multiple testing issues. More recently, \citet{hussain2023falsification} propose a kernel test for the null hypothesis \begin{align}
\label{eq:catetest}
    \hnull:  \estimandrct(X) = \estimandobs (X), \quad  \pxrct-\mathrm{a.s.}
\end{align}
 The main advantage of such a test is that it can detect bias in arbitrarily fine-grained subpopulations without suffering from multiple testing corrections. Further, \citet{demirel2024benchmarking} extends it to account for right-censored outcomes. However, all the statistical tests with granularity fall short of incorporating tolerance functions. In contrast, our null in~\Cref{eq:catetolnull} allows tolerance %by choosing a relevant subset of features $\mathcal J$. 
and recovers $\hnull$ in~\Cref{eq:catetest} when the tolerance functions are the same ($\lbobs(x) = \ubobs(x)$ for all $x \in \XX$), and we set $\J = \{1,\ldots,d\}$.

  
\paragraph{Combining data for  estimation}
 In the presence of observational and randomized data, an alternative to testing involves estimating the bias and correcting for it, ultimately leading to a more accurate treatment effect estimate
%A recent line of work proposes to combine randomized and observational data to estimate treatment effects
~\citep{kallus2018removing,yang2020improved,wu2022integrative,yangelastic,rosenman2023combining,yuwen2023enhancing}. These approaches are promising when the support of the two studies is the same, as they can reduce the variance of the treatment effect estimates by pooling the data.
The work of \citet{cheng2021adaptive} is particularly related to ours, where the authors use kernel regression to estimate the treatment effect conditional on a subset of features. 
%This line of research focuses on learning and correcting the bias between observational and randomized estimates. 
However, a caveat of bias correction is that it requires matching support of both studies: when the supports are different, learning the bias requires strong parametric assumptions for extrapolation. In contrast, statistical tests aim to identify flawed observational studies; see, e.g. \cite{forbes2020benchmarking}. This task is feasible even in settings where the supports do not match, as it is enough to detect differences in the common support of the two studies.  %For this reason, it has been widely adopted in settings where the two studies have different supports~\citep{forbes2020benchmarking}.



 


