We propose tools to evaluate the presence of datasets using the following definitions: a dataset is \textbf{cited} if its paper is present in the reference section, \textbf{mentioned} if its name, aliases or URL are in the abstract or in a section of the paper associated with the method or results (i.e., not in a related work or discussion sections), in a table or figure captions or in a footnote, showing an actual \textbf{usage}.
We show our pipeline in \figureref{fig:flowchart}. There are four main components: user input about the list of venues and datasets to track, the open citation index tool OpenAlex~\cite{priem2022openalex}, the full texts of the papers and GROBID, a tool to extract information from scholarly documents. We used OpenAlex because it has an official freely accessible API aggregating and standardizing information from multiple sources such as arXiv, Crossref and Pubmed, and we aimed to create a generalizable process that could be complemented with other tools. We compared it with OpenCitations to evaluate their completeness using the citations returned for a set of cardiac segmentation datasets. We found that 97\% of the citations returned by OpenCitations were also returned by OpenAlex while only 84\% returned by OpenAlex were returned by OpenCitations; thus we chose OpenAlex for our pipeline.

\begin{figure}[H]
\floatconts
  {fig:flowchart}
  {\caption{Pipeline to detect dataset presence and usage. Green CSV represents user input, blue CSV represents extracted data}}
  {\includegraphics[width=0.60\columnwidth,keepaspectratio]{images/flowchart_logo.png}}
\end{figure}

First, we ask the user to specify the list of venues and datasets to track. This includes a dataset name, any aliases referring to the same datasets, and the titles and DOIs of papers associated with these datasets.  

We use the venue list to fetch the list of papers from these venues using the DBLP API and store the papers' titles and DOIs. We then use the paper DOI (or title if the DOI is not available) to query the OpenAlex API to get the following: (i) list of 
referenced papers, (ii) list of words in the abstract, and (iii) open access link to the paper's full text, an example of data from OpenAlex is shown in Appendix~\ref{appendix:OA_exemples}. We then try to fetch the paper's full text. If this step fails, we complement this step with a custom scraping tool. This step can be easily replaced for different venues, as long as the paper PDFs are stored in the same folder, In this step some data cleanup may be necessary to remove duplicates created from the combination of PDF extracted with OpenAlex and with the scraping tool. We then convert the PDF to an XML file using GROBID, a library applying a cascade of machine learning models that first segment the document in different structures like header, full text or reference and then specific models tuned for each type of structure to extract content. This allows to detect different paper sections, while keeping information about figures, tables and footnotes, which were lacking in alternative tools such as PyPDF, an example of the data obtained with GROBID is shown in Appendix~\ref{appendix:GROBID_exemples}.

We use the dataset list to detect their citations and mentions. We detect citations in two ways: based on the dataset's DOI converted to an OpenAlex ID, and based on matching the dataset paper titles to the references sections of the full text. We detect dataset mentions by searching for the dataset's name, aliases or URL in certain fields of the full text. Examples of different types of citations and mentions can be seen in Appendix~\ref{appendix:dataset_presence}.

Finally, we assign each dataset presence to one of the following types: only cited, only mentioned, and both cited and mentioned.
