Keywords: Synthetic Data, Biomedical NER
TL;DR: Improving low-resource Biomedical NER using synthetic data
Abstract: Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP); however, achieving high NER performance in the biomedical domain remains a challenge due to the limited availability of annotated data. To tackle low-resource biomedical NER, we propose a novel approach, BioSynNER, which utilizes synthetic data generation through large language models (LLMs). BioSynNER begins by mining key domain-specific attributes from seed sentences, which are then used to generate highly effective synthetic examples. Interestingly, we find that paraphrasing these seed sentences is more effective than generating data from scratch, as it preserves contextual and structural nuances that enhance Biomedical NER performance. Additionally, BioSynNER integrates the Unified Medical Language System (UMLS), a comprehensive yet noisy medical knowledge base, to address the complexity and diversity of biomedical entity types. This combined approach not only improves NER accuracy in biomedical texts but also provides a scalable framework for synthetic data generation applicable to other specialized domains. Experimental results confirm the effectiveness of BioSynNER, highlighting its potential to advance NER tasks significantly.
Submission Number: 46
Loading