Predicting Success of Model Editing via Intrinsic Features

Published: 24 Sept 2025, Last Modified: 24 Sept 2025INTERPLAYEveryoneRevisionsBibTeXCC BY 4.0
Keywords: model editing, knowledge localization, Logit Lens, ROME
TL;DR: In this work, we evaluate the possibility of predicting success of model editing from intrinsic features, such as the layer in which the edit is performed, the location of the knowledge in the model as well as the strength of the edit.
Abstract: Due to the ever-changing nature of information in the world, the ability to update factual knowledge of LLMs is important both for maintaining their veracity and for reducing the costs of retraining. Model editing has emerged as a research area that aims to perform surgical updates to model parameters with the goal of updating factually incorrect or outdated information. However, the components underpinning success of an edit applied to an LLM are unknown. In this work, we propose two metrics and show empirically that they can serve as indicators of editing outcomes: (1) the location where the knowledge is stored in the parameters, as reflected by the logit-lens technique; and (2) the probability that the model assigns to the original output. We find a correlation between the location of the knowledge and the optimal layer for editing, as well as between the output probability and the edit success, as measured by efficacy and specificity. We also demonstrate the potential use of output probability for setting the regularization of the editing process.
Public: Yes
Track: Main-Long
Submission Number: 7
Loading