Fixing What Fine-Tuning Breaks: A Simple and Efficient Method to Improve Safety Post Domain Adaptation

Vishnu Kabir Chhabra; Ding Zhu; Zhihui Zhu; Mohammad Mahdi Khalili

Fixing What Fine-Tuning Breaks: A Simple and Efficient Method to Improve Safety Post Domain Adaptation

Vishnu Kabir Chhabra, Ding Zhu, Zhihui Zhu, Mohammad Mahdi Khalili

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: fine-tuning, safety, vector steering, large language models, interpretability, robustness

TL;DR: Improving the domain specific and general safety of models post fine-tuning via weight steering.

Abstract: Safety-aligned language models suffer from a reduction in safety post-finetuning even on benign data. Prior works have highlighted a solution to the issue via further preference optimization in the fine-tuned models; however, this method is computationally expensive and requires domain-specific preference optimization data. In this paper, we aim to alleviate the degradation in the general safety of the fine-tuned language models via a weight steering methodology, which is both computationally inexpensive and efficient, and does not require in-domain preference optimization data. We further demonstrate that our methodology has statistically insignificant changes to the model’s general coherence and false rejection rates and retains the model’s domain-specific knowledge. Finally, we discovered that our method also increases the domain-specific safety of the language model without requiring domain-specific safety data

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 14200

Loading