Keywords: sparse attention, modern Hopfield networks, Fenchel-Young losses, Tsallis entropies
TL;DR: We propose new Hopfield energy functions which relate to sparse attention transformations and allow for exact convergence of Hopfield updates and fewer metastable states.
Abstract: Ramsauer et al. (2021) recently pointed out a connection between modern Hopfield networks and attention heads in transformers. In this paper, we extend their framework to a broader family of energy functions which can be written as a difference of a quadratic regularizer and a Fenchel-Young loss (Blondel et al., 2020), parametrized by a generalized negentropy function $\Omega$. By working with Tsallis negentropies, the resulting update rules become end-to-end differentiable sparse transformations, establishing a new link to adaptively sparse transformers (Correia et al., 2019) and allowing for exact convergence to single memory patterns. Experiments on simulated data show a higher tendency to avoid metastable states.
Submission Number: 26
Loading