Keywords: large language models, safety alignment, fine-tuning, parameter editing
TL;DR: We introduce Surgical Safety Repair, which identifies and updates harmful LoRA parameters to improve safety alignment in fine-tuned LLMs.
Abstract: Fine-tuning is a fundamental technique for adapting Large Language Models (LLMs) to specialized tasks, yet it can unexpectedly compromise the model's safety alignment even when using datasets perceived as benign. However, many existing defenses are limited by their dependence on a pre-computed safety vector, typically requiring access to both the base model and a safety-aligned version. Furthermore, the safety alignment achieved by such methods often degrades to simplistic refusal, instead of nuanced, helpful responses.
In this paper, we introduce Surgical Safety Repair (SSR), a novel post-hoc framework designed to precisely correct harmful behaviors in fine-tuned models while maximally preserving their utility. SSR operates in an automated three-stage pipeline: it first leverages a diagnostic dataset to prompt the compromised model to reveal its safety flaws, constructing a model-specific corrective dataset. Then, it employs gradient-based attribution to localize a targeted set of LoRA parameters responsible for harmful outputs. Finally, it performs a parameter-isolated update based on the corrective dataset, using a dual-objective loss to unlearn harmful responses and steer the model towards safe and constructive ones. Experiments on diverse models demonstrate that SSR reduces the harmfulness score to below 5\% while largely preserving the original capabilities of model, with minimal performance drop on downstream benchmarks such as GSM8K.
Furthermore, SSR guides the model to generate high-quality refusals, fostering a deeper and more nuanced safety alignment beyond mere response suppression.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 17144
Loading