AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Shuo Yang; Qihui Zhang; Yuyang Liu; Yue Huang; Xiaojun Jia; Kun-Peng Ning; Jia-Yu Yao; jigang wang; Dai Hailiang; Yibing Song; Li Yuan

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kun-Peng Ning, Jia-Yu Yao, jigang wang, Dai Hailiang, Yibing Song, Li Yuan

Published: 29 Sept 2025, Last Modified: 12 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Safety, Fine-tuning

Abstract: Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where even small amounts of malicious or benign data can compromise safeguards. In this paper, building on the concept of the alignment direction---defined by the weight difference between aligned and unaligned models---we observe that perturbations along this direction preserve model safety. In contrast, perturbations orthogonal to this alignment are strongly correlated with harmful updates, rapidly degrading safety and framing the parameter space as a "narrow safety basin". Based on this insight, we propose **AsFT** (Anchoring Safety in Fine-Tuning), a data-free method that formulates safety-preserving fine-tuning as a constrained optimization problem. AsFT uses the alignment direction as an anchor and restricts parameter updates within the "narrow safety basin" through a tractable Lagrangian relaxation, thereby suppressing harmful updates while preserving task-relevant adaptation. Extensive experiments across multiple datasets and models demonstrate that AsFT reduces harmful behaviors by up to 7.60\%, improves task performance by 3.44\%, and consistently outperforms existing methods across diverse fine-tuning scenarios.

Submission Number: 10

Loading