Decoupling Noise and Toxic Parameters for Language Model Detoxification by Task Vector Merging

Yongmin Kim; Takeshi Kojima; Yusuke Iwasawa; Yutaka Matsuo

Decoupling Noise and Toxic Parameters for Language Model Detoxification by Task Vector Merging

Yongmin Kim, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Safety, Learning algorithms for LMs

Keywords: detoxification, task vector, weight merging, unlearning

TL;DR: We proposed a more effective method for detoxification by merging task vectors extracted from models fine-tuned on split subsets of a toxic dataset.

Abstract: The goal of detoxifying language models is to reduce the chances of producing offensive or harmful output in pre-trained language models (PLMs), ensuring their safer use. A recently proposed detoxification method utilizes the task vector obtained by subtraction from the fine-tuned model on toxic datasets to the pre-trained model. This approach has shown effectiveness for detoxification but still suffers from degradation. This study focuses on further mitigating degradation while maintaining detoxification performance. To mitigate the degradation, we propose a method that detoxifies the PLMs by fine-tuning multiple models on split toxic datasets and by merging the subtracted task vectors. We conducted experiments on two toxic datasets (Civil Comments and Toxigen) with five PLMs (GPT2-small, GPT2-medium, GPT2-large, Phi-1.5, and Llama2-7b), demonstrating that our method consistently achieves a lower toxicity score while preventing the degradation compared to baseline methods. Especially, with the GPT2-small model on the Toxigen dataset, degradation was reduced by 38.9\% compared to that of an existing task vector method while maintaining a similar toxicity score. In addition, we found that merging multiple detoxified models tends to increase the number of parameters that remained almost unchanged from the pre-trained model. We assume that by merging multiple detoxified models, "decoupling noise and toxic parameters" is implicitly achieved. The accidental noise in the parameter shift unrelated to detoxification disappears by averaging noise, whereas the parameter shift associated with detoxification is maintained. We hope that the findings of this study will be applied not only to detoxification but also to many other research domains that seek to suppress undesirable outputs of language models.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 561

Loading