BioSynNER: Synthetic Data for Biomedical Named Entity Recognition

Chufan Gao; Sanjit Singh Batra; Alexander Russell Pelletier; Gregory D Lyng; Zhichao Yang; Eran Halperin; Robert E. Tillman

BioSynNER: Synthetic Data for Biomedical Named Entity Recognition

Chufan Gao, Sanjit Singh Batra, Alexander Russell Pelletier, Gregory D Lyng, Zhichao Yang, Eran Halperin, Robert E. Tillman

Published: 07 Mar 2025, Last Modified: 25 Mar 2025GenAI4Health OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Synthetic Data, Biomedical NER

TL;DR: Improving low-resource Biomedical NER using synthetic data

Abstract: Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP); however, achieving high NER performance in the biomedical domain remains a challenge due to the limited availability of annotated data. To tackle low-resource biomedical NER, we propose a novel approach, BioSynNER, which utilizes synthetic data generation through large language models (LLMs). BioSynNER begins by mining key domain-specific attributes from seed sentences, which are then used to generate highly effective synthetic examples. Interestingly, we find that paraphrasing these seed sentences is more effective than generating data from scratch, as it preserves contextual and structural nuances that enhance Biomedical NER performance. Additionally, BioSynNER integrates the Unified Medical Language System (UMLS), a comprehensive yet noisy medical knowledge base, to address the complexity and diversity of biomedical entity types. This combined approach not only improves NER accuracy in biomedical texts but also provides a scalable framework for synthetic data generation applicable to other specialized domains. Experimental results confirm the effectiveness of BioSynNER, highlighting its potential to advance NER tasks significantly.

Submission Number: 46

Loading