Multi-Source Bangla Violence Text Dataset and Transformer-Based Stacking Ensemble for Social Media Content Moderation

Multi-Source Bangla Violence Text Dataset and Transformer-Based Stacking Ensemble for Social Media Content Moderation

ACL ARR 2026 January Submission6317 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Violence Detection, BX Stacking Ensemble, BanglaBERT, XLM-RoBERTa, Bangla, Text Classification.

Abstract: The exponential rise in user-generated content on social media platforms, such as Facebook, YouTube, and TikTok, has led to an alarming increase in the spread of violence-inciting language, especially in low-resource languages like Bangla. This issue has amplified the need for effective automated systems capable of detecting and filtering harmful content to ensure safer digital environments. In this study, we propose the BX Stacking Ensemble Model, a novel approach that combines the strengths of two powerful transformer-based models, BanglaBERT and XLM-RoBERTa, to improve the detection of violence-related text in Bangla. The model is trained on a newly compiled, diverse dataset of 11,933 samples, which includes both the Vio-Lens dataset and additional instances collected from social media platforms like YouTube, Facebook, and TikTok. The dataset is carefully annotated into three categories: Non-Violence, Passive Violence, and Active Violence. We compare the performance of the BX Stacking Ensemble Model with traditional machine learning models and other transformer-based models, demonstrating that the ensemble approach significantly outperforms baseline models, achieving a Macro F1 score of 0.85. The results highlight the effectiveness of combining both language-specific and multilingual transformers, enabling the detection of nuanced violence-inciting content. This research contributes to the development of more robust and scalable solutions for content moderation, particularly in resource-constrained languages like Bangla. Moreover, it demonstrates the potential of ensemble learning techniques in addressing the challenges of complex text classification tasks in real-world applications.

Paper Type: Long

Research Area: Low-resource Methods for NLP

Research Area Keywords: AI, NLP and its applications, Machine Learning for NLP, Language Models, NLP for Social Good, Classification

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Data resources, Data analysis

Languages Studied: Bengali

Submission Number: 6317

Loading