Fixing What Fine-Tuning Breaks: A Simple and Efficient Method to Improve Safety Post Domain Adaptation
Keywords: fine-tuning, safety, vector steering, large language models, interpretability, robustness
TL;DR: Improving the domain specific and general safety of models post fine-tuning via weight steering.
Abstract: Safety-aligned language models suffer from a reduction in safety post-finetuning even on benign data. Prior works have highlighted a solution to the issue via further preference optimization in the fine-tuned models; however, this method is computationally expensive and requires domain-specific preference optimization data. In this paper, we aim to alleviate the degradation in the general safety of the fine-tuned language models via a weight steering methodology, which is both computationally inexpensive and efficient, and does not require in-domain preference optimization data. We further demonstrate that our methodology has statistically insignificant changes to the model’s general coherence and false rejection rates and retains the model’s domain-specific knowledge. Finally, we discovered that our method also increases the domain-specific safety of the language model without requiring domain-specific safety data
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14200
Loading