%What is the problem?
 
While the increased usage of open data is a positive development, we hypothesize it might introduce a shift in the targeted applications. For example, \cite{varoquaux2022machine} show that since the Kaggle lung cancer challenge in early 2017~\cite{kaggle2017lungCancer}, there has been a disproportionate increase in machine learning papers on lung cancer, while many of the proposed solutions do not differ in practice. A similar concentration on fewer datasets has also been found in machine learning \cite{koch2021reduced}. Another medical imaging example is the segmentation of cardiac ventricles, addressed with multiple competitions~\cite{bernard2018deep,Suinesiaputra2012STACOM,Petitjean2015RVSC,Campello2021MMs}. The latest competition achieved highly accurate results and commercially available software exists \cite{Wu2023ClinicalAdoption}, yet the application still remains popular for evaluating novel algorithms. Moreover, while the availability of a public dataset is a positive step towards getting a problem addressed by the community, the choice of a single dataset for evaluation also results in an overestimation of performances leading to a gap when applied on a different one~\cite{Wu2021EvaluationFDA}.

There is a need to analyse research within a field to understand the progress being made, but next to surveys focused either on methods \cite{litjens2017survey,cheplygina2019not,budd2021survey} or on datasets within a specific application \cite{daneshjou2021lack,wen2022characteristics}, we find few studies on understanding \emph{dataset use} beyond their initial release in the field. We believe this is in part due to identifying dataset usage, as datasets may be used without corresponding citations, and vice versa. Our contributions, aiming to achieve this identification of dataset usage are as follows: \textbf{(1)} We present a fully automated pipeline for quantifying dataset usage based on the analysis of references and the paper full text. 
\textbf{(2)} We present an open-source annotation tool which allows for validation of the method, and can aid in tracking dataset usage in research papers. 
\textbf{(3)} We apply both tools to study the usage of several popular segmentation and classification datasets and their usage in MICCAI and MIDL conference papers between 2013 and 2023. 
\textbf{(4)} We discuss the limitations of our study and tools, display additional practices we found during our study, and provide recommendations to ease the tracking of datasets.





