Safetylock: Guarding LLM againt FuneTuning Risks with Efficient Inference-time Addon

ICLR 2026 Conference Submission16461 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, Safety, Finetuned, Activate
TL;DR: Your finetuned model's back to its original safety standards faster than you can say "SafetyLock"!
Abstract: Fine-tuning large language models (LLMs) on additional datasets is often necessary to optimize them for specific downstream tasks. However, existing safety alignment measures, which restrict harmful behavior during inference, are insufficient to mitigate safety risks during fine-tuning. Alarmingly, fine-tuning with only 10 toxic sentences can significantly degrade a model’s safety. Moreover, considering the proliferation of fine-tuned models, the per-model cost of existing safety restoration methods becomes prohibitive. To address these challenges, we propose \textbf{SafetyLock}, a novel alignment intervention method designed to preserve robust safety post-fine-tuning through efficient and transferable mechanisms. Specifically, SafetyLock builds on the observation that \textbf{fine-tuned models still retain safety-related activation representations similar to those of their base models.} Leveraging this insight, we extract Meta-SafetyLock (a set of safety bias directions that capture key activation patterns associated with safe responses in the original model) across multiple token-level activation dimensions. These directions can then be universally applied to fine-tuned models, thereby restoring and enhancing safety. Experimental results show that \textbf{SafetyLock can re-align fine-tuned models in under 0.01 seconds without incurring additional computational cost.} Moreover, it can reduce the harmful instruction response rate in toxic fine-tuned models from \textbf{60\%} to below \textbf{1\%}. Compared to traditional methods, SafetyLock not only offers superior safety performance but also higher efficiency, providing a scalable solution for ensuring the reliability of customized LLMs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 16461
Loading