Empowering Knowledge Discovery from Scientific Literature: A novel approach to Research Artifact Analysis

Published: 27 Oct 2023, Last Modified: 27 Oct 2023NLP-OSS 2023EveryoneRevisionsBibTeX
Keywords: Dataset Extraction, Software Extraction, Research Artifact Analysis, Knowledge Discovery, Low-Rank Adaptation, Large Language Models, Information Extraction, Natural Language Processing
Abstract: Knowledge extraction from scientific literature is a major issue, crucial to promoting transparency, reproducibility, and innovation in the research community. In this work, we present a novel approach towards the identification, extraction and analysis of dataset and code/software mentions within scientific literature. We introduce a comprehensive dataset, synthetically generated by ChatGPT and meticulously curated, augmented, and expanded with real snippets of scientific text from full-text publications in Computer Science using a human-in-the-loop process. The dataset contains snippets highlighting mentions of the two research artifact (RA) types: dataset and code/software, along with insightful metadata including their Name, Version, License, URL as well as the intended Usage and Provenance. We also fine-tune a simple Large Language Model (LLM) using Low-Rank Adaptation (LoRA) to transform the Research Artifact Analysis (RAA) into an instruction-based Question Answering (QA) task. Ultimately, we report the improvements in performance on the test set of our dataset when compared to other base LLM models. Our method provides a significant step towards facilitating accurate, effective, and efficient extraction of datasets and software from scientific papers, contributing to the challenges of reproducibility and reusability in scientific research.
Submission Number: 9
Loading