Context-Aware Data Cleaning: Optimizing Bengali Text for Contextual Text Classification

Moshiur Rahman Faisal, Abdur Rahman Fahad, Shahriyar Zaman Ridoy, Jannat Sultana, Zinnat Fowzia Ria, Md. Hasibur Rahman, Mohammed Arif Uddin, Rashedur M. Rahman

Published: 2025, Last Modified: 09 Apr 2026SN Comput. Sci. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In Natural Language Processing (NLP), textual data is foundational, yet it presents substantial challenges, especially for under-resourced languages like Bengali. The complexity and volume of Bengali textual data require sophisticated data cleaning techniques. Traditional methods often neglect critical contextual information essential for effective textual analysis. This study highlights the need for context-aware data cleaning, a methodology that maintains linguistic context while removing noise. The study compares context-aware and traditional data cleaning approaches tailored for Bengali text to improve the performance of contextual transformer-based models. Conventional techniques in this study include symbol and punctuation removal, stop-word elimination, stemming, and removing HTML tags or URLs. In contrast, context-aware techniques involve spelling correction, tagging HTML and URLs, preserving punctuation and emojis, and selectively removing less important words using TF-IDF. The current initiative assesses the impact of these strategies through rigorous dataset curation and extensive training in machine learning, deep learning, and transformer-based models on four prominent Bengali datasets: BEmoC, SentNoB, UBMEC, and EmoNoBa. Results show context-aware data cleaning significantly outperforms traditional methods, particularly in enhancing transformer-based model performance. The developed context-aware data cleaning pipeline integrates various techniques, achieving a baseline accuracy improvement of up to 4% across three of the four datasets. These findings underscore the importance of preserving sentence-level context in Bengali for optimal NLP performance while minimizing noise. Additionally, the research introduces a novel context-aware data cleaning pipeline and provides detailed algorithms for its implementation, advancing NLP research and applications in Bengali and similar linguistic contexts.

External IDs:dblp:journals/sncs/FaisalFRSRRUR25