Our approach shares limitations with other methods that rely on kernels for testing. Most notably, the curse of dimensionality can be a significant problem given the small sample size of randomized trials. In addition, the benchmarking strategy is optimistic; outside the common support of the two studies, the bias could be arbitrarily higher than our lower bound $\deltalb$. 

Our discussion suggests several important directions for future research. 
For example, our test could be adapted to the scenario where multiple observational datasets may be available but no randomized trials. Further, in settings where the tolerance functions $\estimandobs_\pm$ are difficult to learn, Assumption (\textit{ii}) in Theorem~\ref{thm:main} may be unrealistic. One way to overcome this limitation is to construct a doubly robust test statistic that effectively combines multiple nuisance functions to relax the required assumptions on the approximation quality of the individual nuisance functions. 
\subsection*{Acknowledgements}
PDB was supported by 
the Hasler Foundation grant number 21050. JA was supported by the ETH AI Center. KD was supported by the ETH AI Center and the ETH Foundations of Data Science.

 %As a corollary, we also obtain theoretical guarantees for the test proposed in \citet{hussain2023falsification}. 
