Keywords: Knowledge Editing, Generalization, Large Language Models, Mechanistic Interpretability
Abstract: Large language models (LLMs) store factual knowledge at scale, and recent editing methods allow local updates of target facts. While effective in single prompt styles, prior evaluations have overlooked whether edits generalize across task formats. In this work, we present a systematic study of cross-format generalization in parametric knowledge editing. Across six curated task formats, we show that edits that succeed in the source format often fail to transfer, and high frequency in training does not guarantee a shared, format-invariant knowledge structure. Representation-level analyses reveal that edit directions cluster by task format, supporting a *distributed knowledge subspace hypothesis* in which knowledge is distributed across multiple format-specific subspaces. To mitigate these failures, we introduce multi-format supervision with iterative expansion, which improves transferability with minimal overhead. Our study reframes generalization failure not only as a practical limitation of current editing methods but also as evidence about the distributed memory mechanisms underlying factual knowledge in language models.
Primary Area: interpretability and explainable AI
Submission Number: 8903
Loading