CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification

CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification

ACL ARR 2026 January Submission905 Authors

26 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: language model detoxification, causal intervention, interpretability, safety and alignment, inference-time intervention, toxicity mitigation

Abstract: Large language models (LLMs) frequently generate toxic content, posing significant risks for safe deployment. Current mitigation strategies often degrade generation quality or require costly human annotation. We propose CausalDetox, a framework that identifies and intervenes on the specific attention heads causally responsible for toxic generation. Using the Probability of Necessity and Sufficiency (PNS), we isolate a minimal set of heads that are necessary and sufficient for toxicity. We utilize these components via two complementary strategies: (1) Local Inference-Time Intervention, which constructs dynamic, input-specific steering vectors for context-aware detoxification, and (2) PNS-Guided Fine-Tuning, which permanently unlearns toxic representations. We also introduce ParaTox, a novel benchmark of aligned toxic/non-toxic sentence pairs enabling controlled counterfactual evaluation. Experiments on ToxiGen, ImplicitHate, and ParaDetox show that CausalDetox achieves up to 5.34\% greater toxicity reduction compared to baselines while preserving linguistic fluency, and offers a $7\times$ speedup in head selection.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: Interpretability and Analysis of Models for NLP, Language Modeling, Machine Learning for NLP, Dialogue and Interactive Systems

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 905

Loading