Multilingual Abusive Comment Detection at Scale for Indic Languages

Vikram Gupta; Sumegh Roychowdhury; Mithun Das; Somnath Banerjee; Punyajoy Saha; Binny Mathew; hastagiri prakash vanchinathan; Animesh Mukherjee

Multilingual Abusive Comment Detection at Scale for Indic Languages

Vikram Gupta, Sumegh Roychowdhury, Mithun Das, Somnath Banerjee, Punyajoy Saha, Binny Mathew, hastagiri prakash vanchinathan, Animesh Mukherjee

Published: 17 Sept 2022, Last Modified: 23 May 2023NeurIPS 2022 Datasets and Benchmarks Readers: Everyone

Keywords: Abusive Text Detection, Indic Languages, Social Media

Abstract: Social media platforms were conceived to act as online `town squares' where people could get together, share information and communicate with each other peacefully. However, harmful content borne out of bad actors are constantly plaguing these platforms slowly converting them into `mosh pits' where the bad actors take the liberty to extensively abuse various marginalised groups. Accurate and timely detection of abusive content on social media platforms is therefore very important for facilitating safe interactions between users. However, due to the small scale and sparse linguistic coverage of Indic abusive speech datasets, development of such algorithms for Indic social media users (one-sixth of global population) is severely impeded. To facilitate and encourage research in this important direction, we contribute for the first time MACD - a large-scale (150K), human-annotated, multilingual (5 languages), balanced (49\% abusive content) and diverse (70K users) abuse detection dataset of user comments, sourced from a popular social media platform - ShareChat. We also release AbuseXLMR, an abusive content detection model pretrained on large number of social media comments in 15+ Indic languages which outperforms XLM-R and MuRIL on multiple Indic datasets. Along with the annotations, we also release the mapping between comment, post and user id's to facilitate modelling the relationship between them. We share competitive monolingual, cross-lingual and few-shot baselines so that MACD can be used as a dataset benchmark for future research.

Author Statement: Yes

Supplementary Material: pdf

License: The code and dataset is available for only research purposes and any commercial usage is strictly prohibited. The dataset MACD and model AbuseXLMR is distributed under CC BY-NC-SA license. CC BY-NC-SA allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.

URL: https://github.com/ShareChatAI/MACD

Contribution Process Agreement: Yes

In Person Attendance: No

29 Replies

Loading