BanVATLLM and BanTSS: A Multimodal Framework and a Dataset for Detecting Toxic Speech in Bangla and Bangla-English Videos
Keywords: Multi Model, Large Language Model, Toxicity, Text, audio, video
TL;DR: This study addresses detecting toxic speech in Bangla and Bangla-English videos using multimodal data and deep learning techniques.
Abstract: The rise of video content on social media has led to toxic speech spread, necessitating effective moderation. This study addresses detecting toxic speech in Bangla and Bangla-English videos using multimodal data and deep learning techniques. The BanTSS dataset, with 431 videos and 2021 annotated utterances, supports this research. We propose the BanVATLLM framework, a multimodal architecture integrating audio, video, and text data. Utilizing advanced models like Whisper, MMS, VideoMAE, Timesformer, and ChatGPT-3.5, BanVATLLM shows high accuracy in classifying toxicity, sentiment, and severity, with high F1 scores and Fleiss’ Kappa scores. Results include 95.78% F1 and 95.72% accuracy for toxicity, 88.27% F1 and 88.55\% accuracy for severity, and 84.85% F1 and 83.86% accuracy for sentiment, enhancing detection in low-resource languages.
Submission Number: 13
Loading