Streamlining Knowledge Discovery in Scientific Literature: A Comprehensive End-to-End System for Research Artifact Analysis

ACL ARR 2024 June Submission2547 Authors

15 Jun 2024 (modified: 08 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Knowledge Discovery and Research Artifact Analysis (RAA) are crucial for promoting reproducibility and reusability in scientific research. In this work, we introduce a novel end-to-end system to efficiently identify and analyze tangible research artifacts (RAs), specifically datasets and software, within scientific literature. Building on recent advancements, our architecture employs Large Language Models (LLMs) fine-tuned with the Low-Rank Adaptation (LoRA) method to streamline the process of RAA into an instruction-based Question Answering (QA) task. The system comprises five stages: (i) candidate detection using a list of curated keywords and gazetteers, (ii) RA mention identification and validation, (iii) extraction of RA mention metadata, such as names, versions, licenses, and URLs, (iv) classification of RA mentions by usage and provenance, and (v) deduplication of RA mentions to ensure the uniqueness of each identified RA. Through benchmarking on two RA mention datasets, we demonstrated robust performance in RAA and provided a comprehensive qualitative analysis, underscoring the nuances and complexities of ensuring reproducibility and reusability in diverse scientific fields.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: open information extraction, knowledge base construction, entity linking/disambiguation, document-level extraction
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 2547
Loading