Keywords: Model Editing, Safety Alignment, Large Language Model
Abstract: Editing large language models is challenging as incorporating new knowledge often requires sequential parameter updates while maintaining model capability. In this work, we experimentally observe that sequential knowledge updating under the locate-then-edit framework can introduce safety risks, regardless of whether the knowledge being edited is benign or malicious. We propose a novel model editing approach that estimates safety transforms and identifies corresponding safety directions in the neural activation space, and then aligns neural activation updates and network parameter updates under the safety constraints, resulting in a safety-aware model editing approach. We evaluate our approach on open-source LLMs, Llama-3-8B-Instruct and Qwen3-4B-Instruct, using the benchmark datasets ZsRE and COUNTERFACT, as well as the malicious dataset Mal-KSet. Experimental results demonstrate that our approach effectively reduces unsafe responses to malicious queries while preserving the effectiveness of model editing.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: model bias/unfairness mitigation, model editing, safety and alignment, adversarial attacks/examples
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings
Languages Studied: English
Submission Number: 4911
Loading