Model Editing is Over: Revealing Its Illusory Success and Fragile Foundation

08 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language models, model editing, adversarial attack
Abstract: Knowledge editing refers to updating, deleting, or forgetting outdated or incorrect knowledge in large language models (LLMs). Compared to traditional methods like fine-tuning, retrieval-augmented generation or introducing extra memory modules, locate-then-edit (LTE) has recently emerged as a promising paradigm of the current literature due to its great effectiveness and efficiency: by precisely editing a small subset of parameters such that a specific fact is updated while preserving other knowledge. Despite its great success reported in previous literature, we find the apparent reliability of LTE rests on a fragile foundation and the current literature is largely driven by illusory success. Other than utilizing real semantics, the fundamental goal of steering the model’s output toward a target with minimal modification could encourage exploiting hidden shortcuts, something like adversarial attack. This problem directly challenges the feasibility of the current LTE literature at its very foundation, as shortcuts are inherently at odds with robust knowledge integration. Coincidentally, this issue has long been obscured by evaluation frameworks that lack the design of negative examples. To uncover it, we systematically develop a suite of new evaluation methods. Strikingly, we find that state-of-the-art approaches collapse even under simplest negation queries. Our empirical studies uncover that LTE is likely to be based on shortcuts rather than full semantics, calling for an urgent reconsideration of the very basis of LTE before further advancements can be meaningfully pursued.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 2922
Loading