\subsection{Data selection}
We apply our tools to a set of 20 publicly available medical datasets for both classification and segmentation of various organs shown in Appendix~\ref{appendix:datasets}. We initially tried a systematic procedure of identifying datasets via Google datasets and OpenAlex. However, this resulted in many poorly documented datasets (particularly on COVID-19) which did not have distinct names, and of which we could not trace whether they were in part duplicated from other datasets.
Therefore, we selected datasets based on a combination of prior knowledge of the authors and consulting recent surveys in medical imaging which provided a table or list of datasets \cite{hesamian2019deep,eljurdi2021high,niyas2022medical,qureshi2022medical,CALLI2021102125,guan2021domain}. In order to obtain enough data to analyse, we took the following aspects into account for the selection: presence of a paper linked to the dataset available in OpenAlex, year of publication, having some citations in OpenAlex, and having a unique name or acronym to make the detection process more reliable. We chose to analyse papers from two major conferences about medical image analysis, MICCAI and MIDL, so that the papers are more likely to contain the presence of such datasets.

We identified 4835 papers in total (4569 from MICCAI and 266 from MIDL), however, 44 were discarded as we could not obtain information on the content of the paper or the list of references. We categorize the remaining papers in three groups, where for each group we slightly adjusted our processing due to the missing data:
\begin{itemize}
\item Group 1 with n=2327 papers either have all all information, or only miss the abstract from OpenAlex. In this case, we analyze the abstract from the full text of the paper.

\item Group 2 with n=2237 where full text is not available, but we can still detect dataset mentions using the OpenAlex abstract. This is an important limitation, as the abstract does not always mention the datasets used. 
All the papers in this group are from MICCAI, showing the usefulness of the modular part to obtain the full texts. Unlike MICCAI papers, the structure of the PMLR website and the complete open access of PDFs made possible the development of the scraping tool for all MIDL full texts while they were not accessible from OpenAlex.

\item Group 3 with n=227 papers which do not have references in OpenAlex. We therefore detect citations only with our simpler matching of dataset papers’ titles with Grobid, which may result in less accurate detection. A majority of papers from this group are from MIDL as the information for papers from this conference is almost absent from OpenAlex. This shows that the download and analysis of the full text is a crucial and needed aspect of our method. %This is also an argument in favour of open science and the free accessibility of scientific papers.
\end{itemize}

\subsection{Concentration of research on few datasets}
Although we considered the number of citations in OpenAlex to make the first selection of datasets, some datasets had very low numbers of citations and mentions in MICCAI and MIDL. We only present in \figureref{fig:cumul_counts,fig:stackbar_presence_type} results for eight datasets with the highest usage or that exemplify one of our conclusions, a more complete version including all the datasets can be found in Appendix \ref{appendix:complete_figures}. This result may highlight the focus on some particular datasets also shown in \cite{koch2021reduced} when using publicly available data, especially for datasets for the same task (cardiac segmentation and chest classification) as ACDC and M\&Ms. This is also visible in \figureref{fig:cumul_counts} with the large gap between the count of citations and mentions for BRATS and the rest of the most present datasets.


Note the difference in growth between the datasets, which might suggest a snowball effect where popular datasets become even more popular. This seems to be the case for BRATS, ACDC or Chexpert which have a very strong growth in citations and mentions. For other datasets like LIDC-IDRI or DRIVE, the number of citations and mentions is more gradual and even stagnates for DRIVE. Multiple factors can impact the popularity of a dataset, one of the most straightforward is the year of publication but while Chexpert and PadChest have been released at the same time, the second is almost absent from our list of papers. Therefore, other aspects such as how the dataset is updated or has a competition been organized with the dataset could be an explanation for such differences.

\begin{figure}[H]
\floatconts
    {fig:cumul_counts}
    {\caption{Cumulative counts per year of dataset citations (full line) and mentions (dashed line) for classification datasets (a) and segmentation datasets (b).}} 
    {
        \subfigure{
            \label{fig:cumul_count_classif}% label for this sub-figure
            \includegraphics[width=0.45\textwidth]{images/classification_citations_mentions.png}
        }
        \subfigure{
            \label{fig:cumul_count_segm}% label for this sub-figure
            \includegraphics[width=0.45\textwidth]{images/segmentation_citations_mentions.png}
        }
    }
\end{figure}

\subsection{Lack of citation standards leads to difficulty in tracking usage}

A dataset's citation doesn't necessarily imply actual usage and not all used datasets are cited in the references section. We further analyze this difference between mention and citations with \figureref{fig:stackbar_presence_type} in which we assign each presence of a dataset in a paper to one of the categories described in Section \ref{section:quantifying}. We found out that even if there is variability in the groups' proportion for each subset, we can observe that almost every subset has more than 25\% of datasets being only cited and around 10\% being only mentioned.
We considered papers from the "Only Cited" group as not using the dataset while citing it in the introduction or related works, mostly for general statements about machine learning usage in the medical sector. However, 132 papers out of 233 miss the full text and therefore only the abstract is used to detect the mention, a fraction of these papers could therefore mention the dataset and use it but the lack of information prevents our tool from detecting it.
On the other hand, the "Only Mentioned" group mostly represents papers that are using a dataset without citing the associated paper. These two groups display the lack of standards to indicate the usage of a dataset such that it can be easily tracked. It also supports our approach to analyze part of the full text to determine such a usage. 

\begin{figure}[H]
\floatconts
  {fig:stackbar_presence_type}
  {\caption{Type of presence per dataset and venue. The number in [] indicates the total number of papers for this subset. The "Only Cited" group in blue represents papers that cite a dataset without having a mention detected and therefore may not use it. The "Only Mentioned" group in orange represent the bad citation practice as the usage would not be detected by tools tracking the citations. The "Cited and Mentioned" group in green represent the best practice.}}
  {\includegraphics[width=0.60\columnwidth,keepaspectratio]{images/stackbar_presence_type.png}}
\end{figure}


