Abstract: Research in Bengali Natural Language Processing (BNLP) is rapidly expanding. Despite being one of the most widely spoken languages in the world, BNLP research remains insufficient, particularly in Bengali speech recognition. The language’s rich morphology, agglutinative structure, and diverse dialects make text and speech processing especially challenging. However, these challenges can be addressed with effective preprocessing techniques. Various organizations in Bangladesh and West Bengal are integrating Natural Language Processing (NLP) into their services, but without a thorough understanding of preprocessing, these implementations remain incomplete. Applying proper preprocessing techniques to the Bengali language will serve as a foundation for developing robust NLP applications. This paper presents a comprehensive review of preprocessing techniques in BNLP based on state-of-the-art research. It covers key areas such as sentiment analysis, Named Entity Recognition, speech recognition, text categorization, and summarization. First, the paper provides an in-depth discussion of Bengali language characteristics and research areas in BNLP. It then explores the challenges faced by researchers in processing Bengali text and speech. Additionally, it details various preprocessing techniques, highlighting their advantages and disadvantages. Finally, the paper examines future directions for BNLP, emphasizing the role of effective preprocessing in advancing the field.
External IDs:doi:10.1109/access.2025.3574234
Loading