

In this section, we provide a concrete application of the benchmarking framework using the Women's Health Initiative (WHI) study. We show the strengths of our testing procedure and how tolerance and granularity are necessary for effective benchmarking.

\subsection{The WHI controversy}
The WHI study included a randomized trial and an observational study that investigated the use of hormone therapy~(HT) for preventing common sources of mortality among postmenopausal women, including cardiovascular disease, cancer, and fractures~\citep{anderson2003implementation}.

\paragraph{To HT, or not to HT} The initial results of the WHI study in 2002 led to fear and confusion regarding the use of hormone therapy (HT) after menopause, resulting in a dramatic reduction in prescriptions for HT  around the world. Although in 2002, it was stated that HT increases the risk of coronary heart disease (CHD) for all women, subsequent studies clearly showed that younger women close to menopause can benefit from HT. 
 Indeed, for at least 2 decades before the WHI study, observational studies had suggested that HT reduces the risk of CHD~\citep{stampfer1991estrogen,henderson1991decreased,grady1992hormone,grodstein2000prospective}. Further, subsequent randomized trials have continued demonstrating the benefits of HT when started early in young women close to menopause~\citep{hodis2016vascular,taylor2017effects}. To date, the consensus among epidemiologists is that hormone therapy reduces the risk of CHD in women aged less than 60 years and within 10 years of menopause; see e.g. the current guidelines for menopausal hormone therapy~\citep{lee20202020}. 



  
\paragraph{Limitations of the WHI randomized trial} The main issue with the randomized trial from the WHI study is that younger women's cardiac events are relatively rare. Indeed, not only would it have been prohibitively expensive to conduct a randomized trial exclusively in younger women, but it would have also taken many years to accumulate enough events to reach statistical significance. Hence, the trial lacked enough events to reach statistical significance on the subgroup of interest. On the other hand, the average treatment effect (over all the patients in the trial) suggested an increase in CHD risk because the majority of cardiac events came from older women, and epidemiologists concluded that HT is harmful to all women.  
 
  \paragraph{Benchmarking can help!} It has been argued that in the 10 years since the WHI study, many women have been denied HT, significantly disadvantaging a generation of women~\citep{sturdee2011updated}. The natural question is, thus, if, going back in time,  benchmarking the observational study could have prevented such a turn of events.  
  Indeed, this is the perfect setting to test our methodology, as we would like to ask the question:
 \begin{align*} & \emph{Is the bias in the observational study enough to explain away } \\&\emph{\;\;\;the benefits of HT in young women close to menopause?}
 \end{align*}
 In what follows, we show that answering such a question requires a statistical test that offers tolerance.  
 Further, even though we cannot demonstrate that granularity is necessary in this concrete example\footnote{To do so, we would need to know a small biased subgroup in the observational study and show that only the tests with granularity detect the bias. Unfortunately, we are unaware of subgroups that were found to be biased in the WHI study.}, we stress that it is equally important in practice. This is especially true with respect to age and time since the start of menopause, as the tests without granularity can fail to detect subgroup bias that cancels on average, as shown in our semi-synthetic experiments. 
 

\begin{table}
\centering

\caption{The significance level is set at $\alpha=0.05$. $ \deltact$ is the amount of bias that would explain away the positive effect of HT in young women close to menopause. $ \deltalb$ is the maximum bias detected in the observational study. $\atetestzero$ and $\catetestzero$ denote the respective tests without tolerance, i.e. when the tolerance function is set at $\delta=0$.}
\label{table:whi}
\begin{tabular}{cccccc}
\toprule
\multirow{1}{*}{Statistical tests} & \multicolumn{1}{c}{$\catetest$} & \multicolumn{1}{c}{$\atetest$}&  \multicolumn{1}{c}{$\catetestzero$}&   \multicolumn{1}{c}{$\atetestzero$} \\ 
\midrule
$\deltact$ & $0.32$  & $0.32$ &  $0.32$ & $0.32$ \\
$\deltalb$  & $\mathbf{0.25}$  &    $0.11$  &\xmark & \xmark\\ 
\midrule
Reject the study  & $\color{mygreen} \mathbf 0$   & $\color{mygreen} \mathbf 0$ & $\color{pierLink} \mathbf 1 $ &  $\color{pierLink} \mathbf 1 $ \\
\bottomrule
\end{tabular}
\end{table}


\subsection{Experimental results}
Linking back to our question of interest, we demonstrate how our method can provide a correct answer, i.e. one that aligns with the epidemiology literature. A natural way to do so is to first estimate from the available data the amount of bias that would explain away the treatment effect on the group of interest, defined as 
$$
\deltact \defeq \Big|\EE_{\pobs} \left [  \estimandobs(X) \mid X \in G \right ] \Big| .
$$
In essence, the critical value quantifies the minimum strength of bias for which positive and negative values of treatment effect are reasonable, thereby invalidating the observational study results\footnote{Note that other choices for the critical value are possible, and practitioners should determine the most appropriate one given the specific context.}. In our example, the group $G$ is defined as young women (age $\leq 60$) who are close to menopause ($\leq 10$ years). 


Similarly to the semi-synthetic experiments, we instantiate the tolerance functions using constant upper and lower bounds, i.e. $\estimandobs_{\pm}(X) = \estimandobs(X) \pm \delta$ for some constant $\delta \in \RR^+$. We compute the lower bound $\deltalb$ on the maximum amount of treatment effect bias in the observational study, as defined in~\Cref{eq:deltalb}.
We remark that this quantity can be computed only for tests that allow some tolerance. Then, our decision-making procedure will flag the observational study as invalid if $\deltalb \geq \deltact$.  




\paragraph{Experimental details} We consider a binary-valued outcome: the presence of coronary heart disease
within the follow-up period.
 We choose as covariates $X$ the basic adjustment variables used in many existing analyses, and we further limit patients to those who were not current users of HT at the time of enrolment, as the duration of HT use has been found to have a substantial impact on treatment effects~\citep{prentice2005combined,vandenbroucke2009hrt}. We refer to~\Cref{apx:whi_exp} for complete experimental details. 

We now present evidence that our procedure can yield the conclusions established in the epidemiological literature. It avoids issuing false alarms when the bias is negligible (tolerance) and detects a larger amount of bias, as it is more powerful than tests based on average  effect (granularity).







\paragraph{Results} In~\Cref{table:whi}, we show the result for all the statistical tests on the WHI study. First, we observe that both tests that allow for tolerance correctly do not flag the study, while $\catetestzero$ and $\atetestzero$ do. This difference shows the importance of tolerance for distinguishing between small and large amounts of bias. Second, we observe that the lower bound on the bias is larger for the test with granularity $\catetest$. Such behavior is expected and shows the importance of granularity to detect bias that would otherwise go unnoticed using the test without any granularity $\atetest$. 







