Keywords: Low-resource languages, prompting, information extraction, NER, Bangla, Basque
Abstract: Despite the strong multilingual abilities of modern LLMs, biomedical information extraction remains inconsistent for low-resource, morphologically rich languages such as Bangla/Bengali and Basque. Prior investigations of prompt design and output schemas have been focused on high-resource settings. To bridge this gap, we systematically evaluate biomedical named entity recognition (NER) with open LLMs under multiple prompting settings. We find that span-based extraction is much more effective than BIO tagging for LLM prompting across all languages, while moving from statement-based prompting to question-based prompting has a stronger effect on low-resource languages than on high-resource languages (e.g., +57% for Bangla and +109% for Basque, but only +28% for English and +22% for Spanish). Our breakdowns by error type show that translation-based prompting cuts Bangla hallucinations by 64% and QA-style prompting lowers Basque empty prediction errors by 61%. Our results offer practical guidance for building reliable multilingual biomedical NER systems in low-resource languages.
Paper Type: Long
Research Area: Multilinguality and Language Diversity
Research Area Keywords: Low-resource languages, prompting, information extraction, NER, Bangla, Basque
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: Bangla, Bengali, Basque, Spanish, English
Submission Number: 7858
Loading