SAME: Safety-Aware Model Editing Guided by Safety Transformation

SAME: Safety-Aware Model Editing Guided by Safety Transformation

ACL ARR 2026 January Submission4911 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model Editing, Safety Alignment, Large Language Model

Abstract: Editing large language models is challenging as incorporating new knowledge often requires sequential parameter updates while maintaining model capability. In this work, we experimentally observe that sequential knowledge updating under the locate-then-edit framework can introduce safety risks, regardless of whether the knowledge being edited is benign or malicious. We propose a novel model editing approach that estimates safety transforms and identifies corresponding safety directions in the neural activation space, and then aligns neural activation updates and network parameter updates under the safety constraints, resulting in a safety-aware model editing approach. We evaluate our approach on open-source LLMs, Llama-3-8B-Instruct and Qwen3-4B-Instruct, using the benchmark datasets ZsRE and COUNTERFACT, as well as the malicious dataset Mal-KSet. Experimental results demonstrate that our approach effectively reduces unsafe responses to malicious queries while preserving the effectiveness of model editing.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: model bias/unfairness mitigation, model editing, safety and alignment, adversarial attacks/examples

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings

Languages Studied: English

Submission Number: 4911

Loading