Track: Technical
Keywords: Interpretability, Large language models, Safety fine-tuning, Toxicity reduction
TL;DR: This paper provides a neuron-level understanding of the DPO algorithm, refuting the claim that it reduces toxicity solely by dampening toxic neurons (4.9%) and showing it works via cumulative effects across four neuron groups.
Abstract: Safety fine-tuning algorithms are widely used to reduce harmful outputs in language models. While studies show that these algorithms induce minimal changes to pre-trained model parameters, the mechanisms of how such small parameter changes lead to harm reduction remain unclear. When studying the direct preference optimization (DPO) algorithm for toxicity reduction, current explanation claims that DPO reduces toxicity by dampening activations of the most toxic MLP neurons. However, our activation patching experiments show that this explanation is incomplete. Projections onto a toxicity probe show that only 4.9% of toxicity reduction comes from dampened toxic neurons. Instead, DPO reduces toxicity through distributed activation shifts across four neuron groups: two removing toxicity and two promoting anti-toxicity, cumulatively shifting MLP outputs away from toxicity. Neurons that do not promote toxic tokens still contribute to this reduction through their weakly aligned components. These distributed activation shifts, induced from DPO's minimal parameter changes, form a mask over the pre-trained toxic capabilities, while being small enough to preserve model's general language capabilities. Building on these insights, we propose an activation patching technique on the identified neuron groups, outperforming DPO in reducing toxicity while maintaining general language capabilities.
Submission Number: 60
Loading