Human-Aligned Faithfulness in Toxicity Explanations of LLMs

ACL ARR 2025 May Submission7879 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. In this work, we shift this focus to understanding models' reasoning process about toxicity to enhance their trustworthiness for downstream tasks. Despite extensive research on explainability, existing methods cannot be straightforwardly adopted to evaluate free-form toxicity explanations due to various limitations. To address these, we proposed a novel theoretically-grounded dimension, Human-Aligned Faithfulness (\haf), that evaluates how LLMs' free-form toxicity explanations reflect that of an ideal and rational human agent. We further developed a suite of metrics based on uncertainty quantification that evaluate \haf of toxicity explanations without human involvement, and highlighting how “non-ideal” the explanations are. We measure the \haf of three Llama models (of size up to 70B) and an 8B Ministral model on five diverse datasets. Our extensive experiments show that while LMs generate plausible explanations at first, their reasoning about toxicity breaks down when prompted about nuanced relations between individual reasons and their toxicity stance, resulting in inconsistent and nonsensical responses. Finally, we will opensource the largest toxicity reasoning dataset to date containing LLM-generated explanations. Our code is at: $\href{https://anonymous.4open.science/r/safte-7AE0/}{https://anonymous.4open.science/r/safte-7AE0/}$.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: free-text/natural language explanations,explanation faithfulness,probing,human-centered evaluation
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 7879
Loading