Keywords: Large Language Models, Safety, Fine-tuning
Abstract: Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where even small amounts of malicious or benign data can compromise safeguards. In this paper, building on the concept of the alignment direction---defined by the weight difference between aligned and unaligned models---we observe that perturbations along this direction preserve model safety. In contrast, perturbations orthogonal to this alignment are strongly correlated with harmful updates, rapidly degrading safety and framing the parameter space as a "narrow safety basin". Based on this insight, we propose **AsFT** (Anchoring Safety in Fine-Tuning), a data-free method that formulates safety-preserving fine-tuning as a constrained optimization problem. AsFT uses the alignment direction as an anchor and restricts parameter updates within the "narrow safety basin" through a tractable Lagrangian relaxation, thereby suppressing harmful updates while preserving task-relevant adaptation. Extensive experiments across multiple datasets and models demonstrate that AsFT reduces harmful behaviors by up to 7.60\%, improves task performance by 3.44\%, and consistently outperforms existing methods across diverse fine-tuning scenarios.
Submission Number: 10
Loading