Resolving Lexical Bias in Model Editing

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: PENME tackles the critical vulnerability of model editing techniques by learning a representation space that prevents erroneous edit retrival based on representation similarities.
Abstract: Model editing aims to modify the outputs of large language models after they are trained. Previous approaches have often involved direct alterations to model weights, which can result in model degradation. Recent techniques avoid making modifications to the model's weights by using an adapter that applies edits to the model when triggered by semantic similarity in the representation space. We demonstrate that current adapter methods are *critically vulnerable* to strong lexical biases, leading to issues such as applying edits to irrelevant prompts with overlapping words. This paper presents a principled approach to learning a disentangled representation space that facilitates precise localization of edits by maintaining distance between irrelevant prompts while preserving proximity among paraphrases. In our empirical study, we show that our method (Projector Editor Networks for Model Editing - PENME) achieves state-of-the-art model editing results while being more computationally efficient during inference than previous methods and adaptable across different architectures.
Lay Summary: Language models constantly need to learn new information, but directly rewriting or retraining them can accidentally make them "forget" things they already knew. Imagine trying to update a textbook by erasing and rewriting pages, but without realising it, you accidentally erase other important information. To fix this, a common approach to updating these models is to store each correction separately, and while looking at how the model processes the new prompt, apply it when it seems **similar to a past correction**. Our paper shows that this “similarity test" often falls short. It tends to be tricked by lexical bias, meaning it focuses too much on shared words rather than the actual meaning. For example, if you corrected the model on "The twin city of Portsmouth," it might mistakenly apply that correction to "The twin city of Pittsburgh" just because both sentences contain "The twin city of," even though they refer to different places. This leads to misfires, where the model applies the wrong edit. To solve this issue, we introduce **Projector Editor Networks for Model Editing (PENME)**. PENME uses a small "projector" that transforms a prompt into a space where **meaning outranks wording**. This way, the model becomes much better at recognising when a new prompt truly relates to a past correction, rather than being tricked by similar words. Our system helps language models learn and remember new information better, without forgetting or confusing what they already know. PENME is also easy to use with different language models, offering a fast and reliable way to keep them up to date.
Link To Code: https://github.com/hammadrizwan/PENME.git
Primary Area: Deep Learning->Large Language Models
Keywords: Model Editing, Contrastive Learning, Representation Learning
Submission Number: 13175
Loading