Keywords: associative memory, discrete diffusion models, language modeling, memorization, generalization
TL;DR: As the training dataset size increases, we observe the emergence of a significant entropy gap, where the conditional entropy of most tokens no longer vanishes. This entropy gap corresponds to the Discrete Diffusion Models' generalization regime.
Abstract: Associative Memory (AM) systems reliably retrieve data points by establishing distinct basins of attraction around them. While historically reliant on explicit and well-defined energy functions, as in Hopfield networks, stable attractors can also be formed via conditional likelihood maximization without the need for such functions. Using this aspect, we demonstrate that **Uniform-based Discrete Diffusion Models** (UDDMs) behave similarly to AMs via their utilization of conditional likelihood dynamics for sampling and training. By evaluating token recovery, we identify a memorization-to-generalization phase transition governed by training dataset size. With a small amount of training data, UDDMs exhibit a near-perfect memorization, characterized by vanishing conditional entropy. However, as the size of the training set increases, unseen test examples become stable attractors of the system and can be effectively denoised. This behavior highlights an emergent capability, marking the shift to generalization.
Submission Number: 37
Loading