Keywords: language model detoxification, causal intervention, interpretability, safety and alignment, inference-time intervention, toxicity mitigation
Abstract: Large language models (LLMs) frequently generate toxic content, posing significant risks for safe deployment. Current mitigation strategies often degrade generation quality or require costly human annotation. We propose CausalDetox, a framework that identifies and intervenes on the specific attention heads causally responsible for toxic generation. Using the Probability of Necessity and Sufficiency (PNS), we isolate a minimal set of heads that are necessary and sufficient for toxicity. We utilize these components via two complementary strategies: (1) Local Inference-Time Intervention, which constructs dynamic, input-specific steering vectors for context-aware detoxification, and (2) PNS-Guided Fine-Tuning, which permanently unlearns toxic representations. We also introduce ParaTox, a novel benchmark of aligned toxic/non-toxic sentence pairs enabling controlled counterfactual evaluation. Experiments on ToxiGen, ImplicitHate, and ParaDetox show that CausalDetox achieves up to 5.34\% greater toxicity reduction compared to baselines while preserving linguistic fluency, and offers a $7\times$ speedup in head selection.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Interpretability and Analysis of Models for NLP, Language Modeling, Machine Learning for NLP, Dialogue and Interactive Systems
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 905
Loading