Exploring the Plausibility of Hate and Counter Speech Detectors With Explainable Ai

Published: 01 Jan 2024, Last Modified: 21 Jul 2025CBMI 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In this paper we investigate the explainability of transformer models and their plausibility for hate speech and counter speech detection. We compare representatives of four different explainability approaches, i.e., gradient-based, perturbation-based, attention-based, and prototype-based approaches, and analyze them quantitatively with an ablation study and qualitatively in a user study. Results show that perturbationbased explainability performs best, followed by gradient-based and attention-based explainability. Prototype-based experiments did not yield useful results. Overall, we observe that explainability strongly supports the users in better understanding the model predictions.
Loading