Beyond the surface: Revealing researchers behaviour in public repositories
Abstract: This poster was presented during PhD Connect Conference 2024, organised by The Alan Turing Insitute on the 21st and 22nd of November 2024 in Leeds, UK. Abstract: Ensuring the availability and accessibility of research data has become necessary to advance knowledge. Accurate documentation of the studies, commonly known as metadata, is indispensable to achieve this fundamental goal. This information is vital for adhering to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles in scientific data management. Rapid Artificial Intelligence (AI) advancements create an ideal environment for integrating AI-driven methods into everyday scientific research practices to publish datasets, leading repositories to secure metadata quality by employing skilled data curators, albeit at substantial expenses. This project aims to design a metadata enrichment tool that provides users feedback on the information they enter as free text descriptions while submitting new datasets. Text mining, Machine Learning (ML), and Natural Language Processing (NLP) models, including advanced Large Language Models (LLM) like GPT and Lamma2, will be used for this purpose. Exploring 15,424 metadata reports from BioDare2 reflects that optional relevant additional information fields, such as “Description” and “Comment”, are empty in 62% and 92% of the records, respectively. Nevertheless, 11,526 entities associated with 38 species in the dataset were successfully identified. This opens the door to further uses of the dataset to develop the metadata enrichment tool.
External IDs:doi:10.5281/zenodo.14228703
Loading