AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

ACL ARR 2025 February Submission267 Authors

05 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of malicious or harmless data can compromise safeguards. While many mitigation strategies have been proposed, with Safe LoRA standing out for discretizing and projecting LoRA weights into a safety-aligned subspace to mitigate fine-tuning risks, it overlooks layer continuity, where discrete projections disrupt the continuity of learned features across layers, damaging model performance. In this paper, building on the concept of alignment direction—defined by the weight difference between aligned and unaligned models—we observe that perturbations along this direction preserve model safety. In contrast, perturbations along directions orthogonal to this alignment are strongly linked to harmful direction perturbations, rapidly degrading safety and framing the parameter space as a “narrow safety basin”. Based on this insight, we propose a methodology for safety fine-tuning called AsFT (Anchoring Safety in Fine-Tuning), which integrates a regularization term into the training objective. This term uses the alignment direction as an anchor to suppress updates in harmful directions, ensuring that fine-tuning is constraint within the “narrow safety basin”. Extensive experiments on multiple datasets show that AsFT outperforms Safe LoRA, reducing harmful behavior by 7.60%, improving model performance by 3.44%, and maintaining robust performance across various experimental settings. Our code is available at \url{https://anonymous.4open.science/r/Anonymous-40D9}.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: security and privacy, fine-tuning

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 267

Loading