SafeAdapt: Safeguarding Large Language Models During Model Adaptation

ACL ARR 2025 February Submission5007 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) are essential to many AI applications, while adaptations like model pruning and task-specific fine-tuning can unintentionally cause safety risks by altering weight configurations. Previous efforts to improve safety have focused primarily on fine-tuning or RLHF to realign model behavior with ethical standards. However, these methods often demand significant resources, making them challenging to implement in scalable environments. In this paper, we introduce SafeAdapt, an efficient approach aimed at preserving safety alignment by identifying and safeguarding crucial safety-related weights within models. To achieve this, we propose a saliency criterion that evaluates how weight perturbations influence safety-aligned responses and quantifies the sensitivity of each weight to this safety. Based on this, we develop weight preservation strategies to preserve the most crucial weights during model fine-tuning and pruning, ensuring the continued safety of the model. The effectiveness of SafeAdapt is validated through extensive experiments on widely adopted models such as Llama, Qwen, and Gemma, demonstrating its capability to identify safety-related weights and effectiveness in maintaining the safety of fine-tuned or pruned models.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: calibration/uncertainty,robustness
Languages Studied: English
Submission Number: 5007
Loading