SentiGOLD: A Large Bangla Gold Standard Multi-Domain Sentiment Analysis Dataset and Its Evaluation

Md. Ekramul Islam, Labib Chowdhury, Faisal Ahamed Khan, Shazzad Hossain, Md. Sourave Hossain, Mohammad Mamun Or Rashid, Nabeel Mohammed, Mohammad Ruhul Amin

Published: 2023, Last Modified: 13 Nov 2023KDD 2023Readers: Everyone

Abstract: In this study, we present a Bangla multi-domain sentiment analysis dataset, named as SentiGOLD, developed using 70,000 samples, which was compiled from a variety of sources and annotated by a gender-balanced team of linguists. This dataset was created in accordance with a standard set of linguistic conventions that were established after multiple meetings between the Government of Bangladesh and a nationally recognized Bangla linguistics committee. Although there are standard sentiment analysis datasets available for English and other rich languages, there are not any such datasets in Bangla, especially because, there was no standard linguistics framework agreed upon by national stakeholders. Senti-GOLD derives its raw data from online video comments, social media posts and comments, blog posts and comments, news and numerous other sources. Throughout the development of this dataset, domain distribution and class distribution were rigorously maintained. SentiGOLD was created using data from a total of 30 domains (e.g. politics, entertainment, sports, etc.) and was labeled using 5 classes (e.g. strongly negative, weakly negative, neutral, weakly positive, and strongly positive). In order to maintain annotation quality, the national linguistics committee approved an annotation scheme to ensure a rigorous Inter Annotator Agreement (IAA) in a multi-annotator annotation scenario. This procedure yielded an IAA score of 0.88 using Fleiss' kappa method, which is elaborated upon in the paper. A protocol for intra- and cross-dataset evaluation was utilized in our efforts to develop a classification system as a standard. The cross-dataset evaluation was performed on the SentNoB dataset, which contains noisy Bangla text samples, thereby establishing a demanding test scenario. We also performed cross-dataset testing by employing zero-shot experiments, and our best model produced competitive performance, which exemplify our dataset's generalizability. Our top model attained a macro f1 of 0.62 (intra-dataset) for 5 classes establishing the benchmark for SentiGOLD, and 0.61 (cross-dataset from SentNoB) for 3 classes which stands comparable to the current state-of-the-art. Our fine-tuned sentiment analysis model\footnotehttps://sentiment.bangla.gov.bd can be accessed online.

0 Replies