Keywords: Explainable AI, Mechanistic Interpretability, Large Language Models
Abstract: Large language models (LLMs) store vast amounts of knowledge, which often requires updates to correct factual errors, incorporate newly acquired information, or adapt model behavior. Model editing methods have emerged as efficient solutions for such updates, offering localized and precise knowledge modification at significantly lower computational cost than continual training. In parallel, LLMs are frequently fine-tuned for a wide range of downstream tasks. However, the effect of fine-tuning on previously edited knowledge remains poorly understood. In this work, we systematically investigate how different fine-tuning objectives interact with various model editing techniques. \textbf{Our findings show that the edited knowledge is more easily forgetten during fine-tuning than intrinsic knowledge acquired through pre-training, revealing a fundamental distinction between post-hoc edits and native model knowledge.} This analysis highlights a key limitation of current editing approaches and suggests that evaluating edit robustness under downstream fine-tuning is critical for their practical deployment. We further find that knowledge retention can be significantly improved by either augmenting edit knowledge with paraphrases or by freezing layers associated with edited content in fine-tuning stage, offering insight for developing more robust editing algorithms.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: model editing, robustness
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 6369
Loading