CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification

CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification

ACL ARR 2025 May Submission2477 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language Models (LLMs) continue to exhibit vulnerability to generating toxic content, which hinders their safe and responsible deployment in real-world settings. To address this, we introduce a novel detoxification framework, CausalDetox, that identifies and fine-tunes attention heads within LLMs that causally drive toxic generation, enabling targeted intervention. Our approach uses the probability of necessity and sufficiency (PNS), a causally grounded criterion, to select the heads most responsible for encoding toxicity. We then fine-tune those heads to further amplify their causal contribution to toxicity. At inference time, we apply targeted interventions to steer model outputs toward non-toxic generations. We evaluate our method on the ToxiGen and ImplicitHate dataset and introduce ParaTox, a new benchmark of paraphrased toxic and non-toxic prompts derived from Vicuna-13B. ParaTox enables controlled, fine-grained evaluation of detoxification methods. Empirical results show that our approach significantly reduces toxicity while maintaining linguistic fluency, providing a controllable and causally motivated path toward safer language generation.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: detoxification, causal representation learning, targeted model editing

Contribution Types: Model analysis & interpretability

Languages Studied: English

Keywords: causal representation learning, toxicity detection, large language models, model interpretability, model detoxification, controllable generation

Submission Number: 2477

Loading