Research Area: Safety, Learning algorithms for LMs
Keywords: detoxification, task vector, weight merging, unlearning
TL;DR: We proposed a more effective method for detoxification by merging task vectors extracted from models fine-tuned on split subsets of a toxic dataset.
Abstract: The goal of detoxifying language models is to reduce the chances of producing offensive or harmful output in pre-trained language models (PLMs), ensuring their safer use. A recently proposed detoxification method utilizes the task vector obtained by subtraction from the fine-tuned model on toxic datasets to the pre-trained model. This approach has shown effectiveness for detoxification but still suffers from degradation. This study focuses on further mitigating degradation while maintaining detoxification performance. To mitigate the degradation, we propose a method that detoxifies the PLMs by fine-tuning multiple models on split toxic datasets and by merging the subtracted task vectors.
We conducted experiments on two toxic datasets (Civil Comments and Toxigen) with five PLMs (GPT2-small, GPT2-medium, GPT2-large, Phi-1.5, and Llama2-7b), demonstrating that our method consistently achieves a lower toxicity score while preventing the degradation compared to baseline methods.
Especially, with the GPT2-small model on the Toxigen dataset, degradation was reduced by 38.9\% compared to that of an existing task vector method while maintaining a similar toxicity score.
In addition, we found that merging multiple detoxified models tends to increase the number of parameters that remained almost unchanged from the pre-trained model.
We assume that by merging multiple detoxified models, "decoupling noise and toxic parameters" is implicitly achieved. The accidental noise in the parameter shift unrelated to detoxification disappears by averaging noise, whereas the parameter shift associated with detoxification is maintained.
We hope that the findings of this study will be applied not only to detoxification but also to many other research domains that seek to suppress undesirable outputs of language models.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 561
Loading