Associative Memory Under the Probabilistic Lens: Improved Transformers & Dynamic Memory Creation

Published: 27 Oct 2023, Last Modified: 08 Jan 2024AMHN23 PosterEveryoneRevisionsBibTeX
Keywords: associative memory model, transformer, hopfield network, modern continuous hopfield network, bayesian nonparametrics, combinatorial stochastic processes
TL;DR: We equip modern continuous Hopfield networks with the ability to create new memories as necessitated by the data
Abstract: Clustering is a fundamental unsupervised learning problem, and recent work showed modern continuous associative memory (AM) networks can learn to cluster data via a novel unconstrained continuous relaxation of the discrete clustering optimization problem. In this work, we demonstrate that the energy function of that AM network can be viewed as the scaled negative log likelihood of a Gaussian mixture model, and that the dynamics of the AM network can be viewed as performing expectation maximization via gradient ascent rather than via closed-form coordinate ascent. Based on this insight, we show that a widespread practical implementation choice - self-attention with pre-layer normalization - approximates clustering on the hypersphere with inhomogeneous von Mises-Fisher likelihoods, suggesting a future experiment to improve transformers. We additionally leverage this connection to propose a novel AM network with the ability to create new memories during learning, as necessitated by the data, by drawing on tools from combinatorial stochastic processes and Bayesian nonparametrics.
Submission Number: 10