Training-Time Explainability for Multilingual Hate Speech Detection: Aligning Model Reasoning with Human Rationales

Published: 24 Nov 2025, Last Modified: 24 Nov 20255th Muslims in ML Workshop co-located with NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Explainable AI, Content Moderation, Human-aligned Explanations, XAI Regularization, Hate Speech Detection
TL;DR: We present a multilingual training-time XAI framework that aligns model reasoning with human rationales, improving accuracy and explanation quality for detecting implicit anti-Muslim hate in English and Hinglish.
Abstract: Online hate against Muslim communities often appears in culturally coded, multilingual forms that evade conventional AI moderation. Such systems, though accurate, remain opaque and risk bias, over-censorship, or under-moderation, particularly when detached from sociocultural context. We propose a \emph{training-time} explainability framework that aligns model reasoning with human-annotated rationales, improving both classification performance and interpretability. Our approach is evaluated on HateXplain (English) and BullySent (Hinglish), reflecting the prevalence of anti-Muslim hate across both languages. Using LIME, Integrated Gradients, Grad X Input, and attention, we assess accuracy, explanation quality, and cross-method agreement. Results show that gradient- and attention-based regularization improve F-scores, enhance plausibility and faithfulness, and capture culturally specific cues for detecting implicit anti-Muslim hate, offering a path toward multilingual, culturally aware content moderation.
Track: Track 1: ML on Islamic Content / ML for Muslim Communities
Submission Number: 10
Loading