Robustness Evaluation of Hate Speech Detection Models Under Structured Adversarial Obfuscation

ACL ARR 2026 May Submission14553 Authors

26 May 2026 (modified: 11 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Hate Speech Detection, Adversarial NLP, Text Obfuscation, Robustness Evaluation, Transformer Models, Content Moderation, Adversarial Attacks, Benchmark Dataset
Abstract: Hate speech classifiers are vulnerable to adversarial text obfuscation, yet existing robustness benchmarks typically evaluate models using either random noise or isolated perturbation techniques, failing to capture the structured multi-strategy evasion behaviour observed in practice. We present a systematic benchmark evaluating four transformer-based detectors against six obfuscation techniques under three attack regimes spanning \textit{realistic}, \textit{uniform}, and \textit{adversarial} distributions. Beyond prior work, we additionally model realistic social media language by obfuscating neutral (non-toxic) tokens alongside toxic ones, and introduce \textit{obfuscation intensity} as a dedicated evaluation axis. Our results show substantial robustness degradation across all models, with F1 drops of up to $-$0.356 and performance deteriorating non-linearly as perturbation density increases. Obfuscation-aware fine-tuning recovers up to $+$0.326 F1, demonstrating that robustness can be substantially improved without architectural modification.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: hate speech detection, adversarial robustness, text obfuscation, transformer models, content moderation, robustness evaluation, abusive language detection
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: no
Submission Number: 14553
Loading