Data Augmentation via Large Language Models and UMLS for Few-shot Named Entity Recognition in Medical TextsDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Few-shot learning with large language models holds substantial potential in the biomedical domain where obtaining extensive annotated data for specialized tasks can often be challenging. In the presence of small, annotated datasets, incorporating domain knowledge from external sources is a common strategy. In this paper, we explore knowledge augmentation strategies for biomedical named entity recognition (NER) by incorporating information encapsulated in the Unified Medical Language System (UMLS). We leverage UMLS knowledge along with its hierarchical structure, and information from large language models (LLMs) to automatically generate new training examples in few-shot settings. We further explore the viability of employing GPT-3.5 for the extraction of biomedical named entities from Reddit data focused on prescription and illicit opioids. The results show an improvement of 13\% on the F$_1$-score on average over five established NER datasets, and a 6\% increase on the Reddit-Impacts dataset after appropriate prompt engineering improvements. Our findings indicate that utilizing UMLS and LLMs as a joint source of prior knowledge can be a viable approach for improving the state of the art for few-shot learning-based NER in medical text.
Paper Type: long
Research Area: Information Extraction
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources
Languages Studied: English
0 Replies

Loading