%Summary of findings
We presented two open-source tools for the detection of dataset usage in scientific papers and applied them to a case study on publicly available medical datasets. We show that papers in major medical conferences tend to use a limited set of datasets, especially for papers addressing the same task. We also found that the lack of citation standards for dataset usage makes tracking such usage difficult, in particular due to (i) papers citing a dataset's paper without mentioning it in particular sections, indicating a non-usage, and (ii) papers mentioning a dataset without citing its paper, which classical bibliometric tools like OpenAlex can not detect.

%Limitation of manual selection and OpenAlex Usage
Our study is limited to a set of datasets and venues manually selected and may therefore be biased by this selection. We also did not try to disambiguate between different datasets versions (for example different years of BRATS or datasets with similar names) due to already having difficulties with identifying these more-easily-identifiable datasets, it could however be valuable information to not overestimate the usage of a dataset or distinguish various tasks present across different version of a dataset. Doing a study on more datasets, venues and tasks would strengthen the outcome of our work. While datasets can be cited but not an associated paper, OpenAlex only keeps track of citations to papers. It is an important limitation and therefore a more precise matching of citations using GROBID could be a solution to track citations without a paper like it can be for Kaggle datasets.

%Not using deep learning + what could have been done
Our method relies on regex matching and their location, it makes our tool usable to other data easier as only some information needs to be changed. We did not use text classification methods based on deep learning, such as fine-tuning a model pre-trained on scientific data like~\cite{beltagy2019scibert}. While this could result in better performances, it implies a fine-tuning for every new set of datasets, reducing the applicability of our tools to new settings.

While doing this study we had some anecdotal findings that we do not report on in the paper, but which we feel may warrant further study.
\begin{itemize}
\item We saw the number of citations a paper has doubled, from 10 to 20, in 2019. This is likely because until 2019 MICCAI used to include citations in the 8-page limit. Relaxing such page restrictions may incentivize authors to not omit dataset citations. 
\item We found many instances of papers associated with Github repositories that were promising the code to be available upon acceptance, but never actually did this. 

\item We found cases where a "backup" of a dataset on Kaggle was cited as if it were the original source. The dataset was often stripped of its metadata and license information, and it was not clear whether the data was exactly the same or a derivative of the original, for a longer discussion please see \cite{jimenez2024towards}.

\item We discovered that ACDC is a popular name, as it can refer to the Automated Cardiac Diagnosis Challenge~\cite{Bernard2018ACDC} but also to the Adverse Conditions Dataset with images of streets \cite{Sakaridis2021ACDC} or to the Automatic Cancer Detection and Classification in Whole-slide Lung Histopathology challenge \cite{li2018computeraided}. 
\end{itemize}

%Final part on why better citation practices are needed (Crediting authors, having better knowledge of what our findings are based on)
We believe that better knowledge and therefore easier access to dataset usage information are needed. In addition to giving due credit to the creators of the dataset, it can raise awareness of the overuse of a particular dataset for a task, which could have a negative impact on real performance, but also an over-representation of a task in regards of real clinical needs. Working towards the adoption of a standard for indicating the usage of a dataset seems to be an essential step to achieve this objective. As examples, NeuroImage has a specific section on data availability at the end of each manuscript, and in 2023, MICCAI added the obligation to declare "the data origin, data license, and (when appropriate) ethics application number for any public or private data used in the preparation of the paper". While such requirements will not solve all the issues at hand, we believe that including a "Data availability" section could be an easy solution to put in place that would pave the way towards more systematic ways of determining the usage of a dataset. There are of course still many unanswered questions as to how exactly we want to implement this, for example what to do in cases of derivative datasets, synthetic data, and so forth, which we hope we can discuss together as a community.

%
