For Better or for Worse, Transformers Seek Patterns for Memorization

Madhur Panwar; Gail Weiss; Navin Goyal; Antoine Bosselut

For Better or for Worse, Transformers Seek Patterns for Memorization

Madhur Panwar, Gail Weiss, Navin Goyal, Antoine Bosselut

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: memorization, training dynamics, interpretability, generalization, language models

TL;DR: Memorization in transformer LMs is tied to pattern acquisition. It is non-trivial and happens in bursts according to shared patterns. Intriguingly, the relative memorization speed of larger and smaller models can change based on the pattern type.

Abstract: Memorization in language models is a critical yet poorly understood phenomenon. In this work, we investigate memorization in transformer-based language models by analyzing their memorization dynamics during training over multiple epochs. We find that memorization is neither a constant accumulation of sequences nor simply dictated by the recency of exposure to these sequences. Instead, much like generalization, memorization appears to be driven by pattern recognition. Tracking memorization dynamics in mixed datasets, we observe that models memorize different sub-datasets in distinct bursts, suggesting that each subset is associated with unique underlying patterns, and that the model prefers to learn these patterns in a consistent order. We also find that easily learnable patterns tend to support generalization on unseen data, while more complex patterns do not. Furthermore, in datasets with weak or absent patterns, larger models may delay memorization relative to smaller ones, a behavior we term $\textit{overthinking}$. Our results show that the subset of sequences memorized by a model over time is not arbitrary, and give insights into the internal processes a model goes through during training. Our code is available at: https://github.com/mdrpanwar/memorization-patterns.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 23578

Loading