Leveraging Machine-Translated Data for Sentiment Analysis in Low-Resource Languages: A Case Study on Bengali

Published: 2025, Last Modified: 06 Nov 2025ICANN (3) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Sentiment analysis involves identifying the polarity of text, determining whether the sentiment is positive or negative. Bengali, the seventh most spoken language globally, remains a low-resource language, which poses challenges for sentiment analysis tasks. This research thus explores the application of machine translation (MT) to generate large datasets for low-resource languages. Specifically, we translate the IMDB review dataset (IMDB-EN) from English to Bengali and Hindi using Google Translator, word-by-word translation, and the “bridge-translation” method. The translated datasets are then used to train models, which are compared to those trained on native datasets for performance evaluation. We extensively evaluate the performance of various traditional machine learning, deep learning, including transformer-based algorithms, alongside large language models (LLM) such as GPT-4o, and Gemini-1.5 Flash. Additionally, 173 Bengali samples from the Google-translated dataset are manually translated to analyze the model’s performance when trained on both translated and native data. A similar experiment is conducted for Hindi, where the IMDB-EN dataset is translated to Hindi (IMDB-HN) and used to train a model, followed by testing against the Hindi Amazon reviews corpus. Our findings indicate that the “bridge-translation” method indeed positively effects classifier’s performance. The model trained on the bridge-translated dataset achieves an accuracy of 90.23% when tested on the native Bengali dataset, demonstrating the potential of using machine translation to boost performance in sentiment analysis for low-resource languages. Our data and code are available (https://github.com/abirmoy/Bengali-Sentiment-Analysis-on-MT-Dataset).
Loading