Beyond the surface: Revealing researchers' behavoiur in public repositories (First version)

Maria Juliana Rodriguez Cubillos, Tomasz Zieliński, T. Ian Simpson, Jason Swedlow, Andrew J. Millar

Published: 27 Nov 2024, Last Modified: 25 Jan 2026ZenodoEveryoneRevisionsCC BY-SA 4.0
Abstract: This poster was presented in the Edinburgh Open Research Conference 2024 in May 29th at the University of Edinburgh. Ensuring the availability and accessibility of data has become necessary in the pursuit of advancing knowledge. This information is vital for adhering to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles in scientific data management. Accurate documentation of the studies, commonly known as metadata, is indispensable to achieve this fundamental goal. Regrettably, public records often fall short, providing inadequate, repetitive, and incomplete descriptions, hindering the seamless flow of knowledge. Our project addresses the metadata challenge by analysing current repositories and designing prompts for better metadata in future. The rapid advancements in AI create an ideal environment for integrating these methods into everyday scientific research practices to publish datasets. Leading repositories secure the metadata quality by employing skilled data curators; nevertheless, this approach demands substantial expenses for database management. Therefore, we aim to develop a user-friendly and cost-effective tool to enrich metadata, specifically targeting named entities within unstructured textual data using AI. Here, we will present our preliminary results from the original metadata assessment in our target repositories. This analysis, employing standard text mining metrics, serves to identify critical features, characteristics, similarities, and differences within the dataset. In our initial steps, we analysed records sourced from The BioDare2 (https://biodare2.ed.ac.uk/), a domain-specific repository for biological time series data that stores over 15,000 datasets, and DataShare (https://datashare.ed.ac.uk/) a domain-agnostic, research data repository at the University of Edinburgh, with more than 6.500 data entries. The BioDare2 database has no curation process apart from minimal length requirements. In contrast, DataShare is considered semi-curated because deposits are filtered for relevance to the repository's scope, valid layout and format, and the exclusion of spam. The repository's curators also offer suggestions for enhancing metadata quality. The difference between the databases manifests itself in our analysis of metadata metrics and reinforces the necessity of efficient and automatic curation. They serve as the foundational groundwork for scoping and the development of our tools for metadata enhancement with...
Loading