Tracing and Reversing Edits in LLMs: A Study on Rank-One Model Edits

ICLR 2026 Conference Submission18024 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Model Editing, Knowledge Editing, Countermeasures to Malicious Knowledge Editing
TL;DR: We propose two novel methods to trace an reverse rank-one mode edits based solely on the edited parameters
Abstract: Knowledge editing methods (KEs) are a cost-effective way to update the factual content of large language models (LLMs), but they pose a dual-use risk. While KEs are beneficial for updating outdated or incorrect information, they can be exploited maliciously to implant misinformation or bias. In order to defend against these types of malicious manipulation, we need robust techniques that can reliably detect, interpret, and mitigate adversarial edits. To that end, we introduce the tasks of tracing and reversing edits. We propose a novel method to infer the edited object entity, solely based on the modified weights, without access to the editing prompt or any other semantically similar prompts, with up to 99\% accuracy. Further, we propose an effective and training-free method for reversing edits. Our method recovers up to 93\% of edits, and helps regain the original model's output distribution without access to any information about the edit. This method can further be used to distinguish between edited and unedited weights. Our findings highlight the feasibility of tracing and reversing edits based on the edited weights, opening a new research direction for safeguarding LLMs against adversarial manipulations.
Primary Area: interpretability and explainable AI
Submission Number: 18024
Loading