NERvous About My Health: Constructing a Bengali Medical Named Entity Recognition Dataset

Alvi Aveen Khan; Fida Kamal; Nuzhat Nower; Tasnim Ahmed; Sabbir Ahmed; Tareque Mohmud Chowdhury

NERvous About My Health: Constructing a Bengali Medical Named Entity Recognition Dataset

Alvi Aveen Khan, Fida Kamal, Nuzhat Nower, Tasnim Ahmed, Sabbir Ahmed, Tareque Mohmud Chowdhury

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Short Paper

Submission Track: Resources and Evaluation

Keywords: Named Entity Recognition, Natural Language Processing, Consumer Health

Abstract: The ability to identify important entities in a text, known as Named Entity Recognition (NER), is useful in a large variety of downstream tasks in the biomedical domain. This is a considerably difficult task when working with Consumer Health Questions (CHQs), which consist of informal language used in day-to-day life by patients. These difficulties are amplified in the case of Bengali, which allows for a huge amount of flexibility in sentence structures and has significant variances in regional dialects. Unfortunately, the complexity of the language is not accurately reflected in the limited amount of available data, which makes it difficult to build a reliable decision-making system. To address the scarcity of data, this paper presents 'Bangla-HealthNER', a comprehensive dataset designed to identify named entities in health-related texts in the Bengali language. It consists of 31,783 samples sourced from a popular online public health platform, which allows it to capture the diverse range of linguistic styles and dialects used by native speakers from various regions in their day-to-day lives. The insight into this diversity in language will prove useful to any medical decision-making systems that are developed for use in real-world applications. To highlight the difficulty of the dataset, it has been benchmarked on state-of-the-art token classification models, where BanglishBERT achieved the highest performance with an F1-score of $56.13 \pm 0.75$%. The dataset and all relevant code used in this work have been made publicly available.

Submission Number: 2061

Loading