Abstract: Large language Models (LLMs) continue to exhibit vulnerability to generating toxic content, which hinders their safe and responsible deployment in real-world settings. To address this, we introduce a novel detoxification framework, CausalDetox, that identifies and fine-tunes attention heads within LLMs that causally drive toxic generation, enabling targeted intervention. Our approach uses the probability of necessity and sufficiency (PNS), a causally grounded criterion, to select the heads most responsible for encoding toxicity. We then fine-tune those heads to further amplify their causal contribution to toxicity. At inference time, we apply targeted interventions to steer model outputs toward non-toxic generations. We evaluate our method on the ToxiGen and ImplicitHate dataset and introduce ParaTox, a new benchmark of paraphrased toxic and non-toxic prompts derived from Vicuna-13B. ParaTox enables controlled, fine-grained evaluation of detoxification methods. Empirical results show that our approach significantly reduces toxicity while maintaining linguistic fluency, providing a controllable and causally motivated path toward safer language generation.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: detoxification, causal representation learning, targeted model editing
Contribution Types: Model analysis & interpretability
Languages Studied: English
Keywords: causal representation learning, toxicity detection, large language models, model interpretability, model detoxification, controllable generation
Submission Number: 2477
Loading