% Randomized trials have traditionally been the gold standard for informed decision-making in medicine, as they allow for unbiased estimates of treatment effects under mild assumptions. However, there is often a significant discrepancy between the patients observed in clinical practice and those enrolled in randomized trials~\citep{duma2018representation}. These distribution shifts compromise the generalizability of the trials to broader populations~\citep{rothwell2005external}. Since observational data is usually more representative of the patient population in clinical practice, the U.S. Food and Drug Administration, for example, promotes tharir use instead~\cite {platt2018fda, klonoff2020new}.
% %Therefore/Instead \fy{some connection conjunction}, the U.S. Food and Drug Administration currently promotes using observational data when randomized data provides limited evidence, as it \fy{unclear reference it = evidence, randomized or observational?} is usually more representative of the patient population in clinical practice~\citep{platt2018fda, klonoff2020new}. 
% %Nonetheless, \fy{doesn't seem the right conjunction} 
% However, observational data often suffers from several sources of bias, such as unobserved confounding, which can significantly compromise the causal conclusions. %drawn from non-randomized data. 
% Hence, it is crucial to assess the quality of observational data before using it for any downstream medical task.

Randomized trials have traditionally been the gold standard for informed decision-making in medicine, as they allow for unbiased estimation of treatment effects under mild assumptions. However, there is often a significant discrepancy between the patients observed in clinical practice and those enrolled in randomized trials, limiting the generalizability of the trial results~\citep {rothwell2005external, duma2018representation}. To address this issue, the U.S. Food and Drug Administration advocates for using observational data, as it is usually more representative of the patient population in clinical practice~\citep{platt2018fda, klonoff2020new}. Yet, a major caveat to this recommendation is that several sources of bias, including hidden confounding, can
compromise the causal conclusions drawn from observational data. %and thus it is crucial to assess their quality before using them for any downstream medical task.

In light of the inherent limitations of  randomized and observational data, it has become a popular strategy to \emph{benchmark} observational studies against existing randomized trials to assess their quality \citep{dahabreh2020benchmarking, forbes2020benchmarking}. The main idea behind this approach is first to emulate the procedures adopted in the randomized trial within the observational study; see e.g.~\citet{hernan2016using} for a detailed explanation. 
Then, the treatment effect estimates from the observational data are compared with those from the randomized data. If the estimates are similar, we may be willing to trust the observational study for patient populations where the randomized data is insufficient.


To support the benchmarking framework, several works propose statistical tests that compare treatment effect estimates between randomized and observational data~\citep{viele2014use,hussain2023falsification,de2023hidden,yangelastic,demirel2024benchmarking}. In particular, two properties have been identified as essential for effective benchmarking of observational studies: \emph{tolerance} and \emph{granularity}.  
Tolerance allows the acceptance of studies with negligible bias that does not impact decision-making, thereby significantly reducing false rejections in real-world settings where some bias is expected. Granularity, on the other hand, allows the detection of bias on small subgroups or individuals that would otherwise go unnoticed.  

However, to date, no existing statistical test satisfies both properties. Our contributions here are as follows.
\begin{itemize}
	\item We design a statistical test for the null hypothesis that treatment effects estimated from the two studies, conditioned on a set of features that define the patient subgroups, differ up to some tolerance. To our knowledge, our test is the first to satisfy tolerance and granularity. We then leverage both properties to estimate an asymptotically valid lower bound on the maximum bias in the observational study. 
	\item  We propose a novel strategy to benchmark observational studies. Specifically, we compare the lower bound on the bias against a \emph{critical value}, e.g. the minimum bias strength that would explain away the estimated treatment effect in a subgroup of interest. If the lower bound is greater than the critical value, we discard the conclusions drawn from the observational study. Finally, we demonstrate that our strategy yields conclusions consistent with current epidemiological knowledge using real-world data.
\end{itemize}


