DataRef: A Dataset of Data Citations
Abstract: The data in this record was generated while developing the Data Gatherer, an LLM-powered tool that automates the identification of dataset mentions and the extraction of structured dataset records for publications. All publications and dataset repository records used for development and testing are open access. This record includes two datasets: DataRef-EXP Dataset (file name: EXP_groundtruth.csv): The DataRef-EXP dataset was created by manually selecting and reviewing scholarly journal articles to ensure a diverse representation of dataset citation formats. Data were exclusively sourced from PubMed Central (PMC). A total of 21 journal articles were chosen, resulting in 48 dataset references. Journal articles were chosen in order to maximize the variation in how included datasets were referenced, enabling a comprehensive evaluation of the Data Gatherer tool’s ability to extract dataset mentions across various formats. Additionally, some articles were chosen due to errors in dataset mentions, like inaccurate accession numbers or incomplete dataset information (e.g., an accession number but no named repository). DataRef-REV Dataset: Full Dataset (file name: Full_REV_dataset_citation_records_Table.parquet): This dataset was constructed using a reverse engineering methodology, leveraging structured metadata from ProteomeCentral and Gene Expression Omnibus (GEO). ProteomeCentral is a valuable source for ground truth data, offering curated metadata for 23,348 publicly available datasets, including dataset identifiers and one or more valid related paper DOIs or PubMed identifiers. Similarly, GEO is a public functional genomics data repository managed by the National Center for Biotechnology Information (NCBI). GEO provides programmatic access through a REST API that we used to retrieve 165,078 dataset identifiers with valid references to publications that mention them. A key limitation of this dataset is that it is based on datasets deposited in repositories that are part of the ProteomeXchange consortium or available in GEO. Each dataset entry includes a unique identifier, typically an accession code, along with the corresponding repository name, such as PRIDE, MassIVE, jPOST, iProX PeptideAtlas, or PanoramaPublic. Sample Dataset (file name: REV_sample_groundtruth.csv): Sampled from the Full dataset described above. It is a balanced sample with respect to the two metadata sources (GEO and ProteomeXchange). In summary, 1242 rows, where each row is a PMC...
External IDs:doi:10.5281/zenodo.15549085
Loading