Shared Memorization Structures in Transformers Revealed by Loss Curvature

Jack Merullo; Srihita Vatsavaya; Owen Lewis

Shared Memorization Structures in Transformers Revealed by Loss Curvature

Jack Merullo, Srihita Vatsavaya, Owen Lewis

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Understanding high-level properties of models, Applications of interpretability, Foundational work, Steering

Other Keywords: memorization

TL;DR: We characterize how memorization works in transformers and show an interpretability based, unsupervised method for removing it that is better than a supervised method

Abstract: We characterize how memorization is represented in Transformer networks. We find that supervised memorization-removal models trained on a targeted set also suppress untargeted memorization, implying a shared representational structure for memorized data. Building on links between memorization and loss curvature, we show this structure is disentangled in weight space when expressed in the eigenbasis of the (K-FAC) Fisher information. Using this decomposition, we propose an unsupervised parameter-ablation method that outperforms a supervised method in suppression of memorization, yields more natural generations in LMs, and improves generalization in label-noisy ViTs. Our work expands the understanding of verbatim memorization in neural networks, and points to practical mitigation methods for suppressing it in trained models.

Submission Number: 273

Loading