Keywords: Memorization, Unlearning, Localization
TL;DR: We propose a method to disentangle sequence memorization and general language model capabilities during pretraining.
Abstract: Verbatim memorization in large language models remains a persistent and unsolved challenge, raising critical concerns for privacy, copyright, and responsible deployment. Existing research suggests that effective unlearning requires targeting the specific neurons responsible for memorization, as broad model updates fail to erase content reliably. However, we show that even these approaches rest on a flawed premise. Through controlled experiments, we demonstrate that memorized sequences are not naturally isolated to specific neurons during training, except in cases where the sequences are highly atypical. In this work, we put forward a new training paradigm that attempts to \textbf{isolate memorization to specific neurons by design}. The core challenge is that gradients from the repeated sequences entangle both ``generalizing'' features that improve general capability, in addition to sequence-specific memorization. We show that a simple change to standard training can implicitly disentangle these by leveraging metadata that identifies repeated sequences. We verify the efficacy of our method (\seqtd) in a proof-of-concept natural language setting and unveil the mechanism by which this disentanglement is possible through the training dynamics of memorization. We conclude by discussing the practical considerations of the deployment of \seqtd and highlight potential avenues for incorporating it into large-scale settings.
Submission Number: 48
Loading