How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

ACL ARR 2025 May Submission4039 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Safety fine-tuning algorithms reduce harmful outputs in language models, yet their mechanisms remain under-explored. Direct Preference Optimization (DPO) is a popular choice of algorithm, but prior explanations—attributing its effects solely to dampened toxic neurons in the MLP layers—are incomplete. In this study, we analyse four language models (Llama-3.1-8B, Gemma-2-2B, Mistral-7B, GPT-2-Medium) and show that toxic neurons only account for 2.5\% to 24\% of DPO's effects across models. Instead, DPO induces distributed activation shifts across all MLP neurons to create a net toxicity reduction. We attribute this reduction to four neuron groups—two aligned with reducing toxicity and two promoting anti-toxicity—whose combined effects replicate DPO across models. To further validate this understanding, we develop an activation editing method that mimics DPO through distributed shifts along a toxicity representation. This method outperforms DPO in reducing toxicity while preserving perplexity, without requiring any weight updates. Our work provides a mechanistic understanding of DPO and introduces an efficient, tuning-free alternative for safety fine-tuning.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Interpretability and Analysis of Models for NLP; Language Modeling; Efficient/Low-Resource Methods for NLP; NLP Applications

Contribution Types: Model analysis & interpretability, Reproduction study, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 4039

Loading