Keywords: associative memory, energy-based models, hopfield network, generative AI
TL;DR: We show that the Energy Transformer is capable of being a generative model, but needs refinements to the architecture to make it competitive.
Abstract: Modern generative approaches like Equilibrium Matching (EqM) train models to approximate energy gradients, yet they typically rely on unconstrained architectures that lack intrinsic energy guarantees. We address this by training the Energy Transformer (ET), a Modern Hopfield Network where the forward pass explicitly performs gradient descent on a global energy function, using the EqM objective. This combination yields a Generative Associative Memory where the architecture strictly enforces the conservative vector field required by the training objective. We evaluate this framework on CIFAR-10, systematically exploring the trade-offs between architectural depth (stacked blocks) and temporal recurrence (iterative refinement within blocks). While a baseline single-layer model demonstrates feasibility (FID 79.72), we find that scaling to multi-block configurations drastically improves generation quality (FID 28.56), suggesting that hierarchical energy landscapes are essential for capturing complex image distributions. We further ablate design choices such as 2D positional encodings, energy minimization timesteps, and guidance strategies, offering a comprehensive analysis of how explicit associative memories can be scaled to competitive generative modeling.
Submission Number: 38
Loading