Keywords: Knowledge Editing, Detecting Edits, Reversing Edits, Unlearning Edits
TL;DR: We study identifying and reversing (unlearning) rank-one model edits
Abstract: Knowledge editing methods (KEs) are a cost-effective way to update the factual content of large language models (LLMs), but they pose a dual-use risk. While KEs are beneficial for updating outdated or incorrect information, they can be exploited maliciously to implant misinformation or bias. In order to defend against these types of malicious manipulation, we need robust techniques that can reliably detect, interpret, and mitigate adversarial edits. This work investigates the traceability and reversibility of knowledge edits, focusing on the widely used Rank-One Model Editing (ROME) method. We propose a method to infer the edited object entity directly from the modified weights, without access to the editing prompt, achieving over 95\% accuracy. Furthermore, we show that edits can be reversed, recovering the model’s original outputs with $\geq$ 80\% accuracy. Our findings highlight the feasibility of tracing and reversing edits based on the edited weights, offering a robust framework for safeguarding LLMs against adversarial manipulations.
Submission Number: 30
Loading