[MASK]ED - Language Modeling for Explainable Classification and Disentangling of Socially Unacceptable Discourse.
Abstract: Analyzing Socially Unacceptable Discourse (SUD) online is a critical challenge for regulators and platforms amidst growing concerns over harmful content. While Pre-trained Masked Language Models (PMLMs) have proven effective for many NLP tasks, their performance often degrades in multi-label SUD classification due to overlapping linguistic cues across categories. In this work, we propose an artifact-guided pre-training strategy that injects statistically salient linguistic features, referred to as artifacts, into the masked language modelling objective. By leveraging context-sensitive tokens, we guide an importance-weighted masking scheme during pre-training to enhance generalization across discourse types. We further use these artifact signals to inform a lightweight dataset curation procedure that highlights noisy or ambiguous instances. This supports targeted relabeling and filtering, enabling more explainable and consistent annotation with minimal changes to the original data. Our approach provides consistent improvements in 10 datasets extensively used in SUD classification benchmarks. $$\textit{\small{Disclaimer: This article contains some extracts of unacceptable and upsetting language.}}$$
Paper Type: Long
Research Area: Computational Social Science and Cultural Analytics
Research Area Keywords: hate speech detection, pre-training, bias/toxicity, human-AI interaction/cooperation, human-in-the-loop, data shortcuts/artifacts, topic modeling
Contribution Types: Model analysis & interpretability
Languages Studied: English
Keywords: hate speech detection, pre-training, bias/toxicity, human-AI interaction/cooperation, human-in-the-loop, data shortcuts/artifacts, topic modeling
Submission Number: 1817
Loading