Representation Shattering in Transformers: A Synthetic Study with Knowledge Editing

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Our work introduces a synthetic framework which enables for a deeper understanding of Knowledge Editing and its side effects with regards to model representations.
Abstract: Knowledge Editing (KE) algorithms alter models' weights to perform targeted updates to incorrect, outdated, or otherwise unwanted factual associations. However, recent work has shown that applying KE can adversely affect models' broader factual recall accuracy and diminish their reasoning abilities. Although these studies give insights into the potential harms of KE algorithms, e.g., performance evaluations on benchmarks, little is understood about why such destructive failures occur. Motivated by this, we define a novel synthetic task in which a Transformer is trained from scratch to internalize a "structured" knowledge graph. The structure enforces relationships between entities of the graph, such that editing a factual association has "trickling effects" on other entities (e.g., altering X's parent is Y to Z affects who X's siblings' parent is). Through evaluations of edited models on this task, we show that KE inadvertently affects representations of entities beyond the targeted one, distorting relevant structures that allow a model to infer unseen knowledge about an entity. We call this phenomenon representation shattering and demonstrate that it degrades models' factual recall and reasoning performance. We further corroborate our findings in naturalistic settings with pre-trained Llama and Mamba models as well. Overall, our work yields a precise mechanistic hypothesis to explain why KE has adverse effects on model abilities.
Lay Summary: When a large language model is patched after training to alter some memorized factual knowledge, the change often degrades its broader knowledge and reasoning capabilities. In the past, the cause behind this phenomenon has been unclear. We study this by building a small synthetic world of linked facts, training a Transformer, and then editing facts while tracking how the network's internal representations change. We show that knowledge editing systematically fractures the neat geometric structure that stores information. We call this "Representation Shattering." We show that the degree of shattering predicts how much the model's overall accuracy drops, and we verify the same effect in real models such as Llama-3 and Mamba. By revealing this hidden failure mode, our work offers a practical warning signal for risky edits and a direction for gentler, more reliable knowledge-updating techniques. Understanding and mitigating representation shattering will help future language models stay accurate, consistent, and trustworthy as they are regularly updated.
Link To Code: https://github.com/KentoNishi/KE-ICML-2025
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: mechanistic interpretability, knowledge editing, transformers
Submission Number: 1639
Loading