Keywords: Understanding high-level properties of models, Applications of interpretability, Foundational work, Steering
Other Keywords: memorization
TL;DR: We characterize how memorization works in transformers and show an interpretability based, unsupervised method for removing it that is better than a supervised method
Abstract: We characterize how memorization is represented in Transformer networks. We find that supervised memorization-removal models trained on a targeted set also suppress untargeted memorization, implying a shared representational structure for memorized data. Building on links between memorization and loss curvature, we show this structure is disentangled in weight space when expressed in the eigenbasis of the (K-FAC) Fisher information. Using this decomposition, we propose an unsupervised parameter-ablation method that outperforms a supervised method in suppression of memorization, yields more natural generations in LMs, and improves generalization in label-noisy ViTs. Our work expands the understanding of verbatim memorization in neural networks, and points to practical mitigation methods for suppressing it in trained models.
Submission Number: 273
Loading