Recasting Transformer Layers as Energy Models

ICLR 2026 Conference Submission18341 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: transformers, energy-based models, layer design, language modeling
Abstract: Foundation models rely on sequence-to-sequence mappings parameterized by neural networks, and the design space of these layers continues to expand. Transformer layers remain the dominant choice due to their strong performance and high parallelism, though many design decisions are still empirically based. We introduce causal energy minimization (CEM), a framework that interprets each transformer layer as an algorithm for solving an energy minimization problem with causal structure. This perspective separates the mathematical interpretation of a layer from its numerical realization, offering a unifying lens for layer design and motivating principled architectural innovations. Within CEM, multi-head attention emerges as a gradient step on an interaction energy under the weights sharing constraint, while gated MLP correspond to element-wise energies. The form of transformer components within CEM suggests a weight-sharing scheme in both attention and MLP blocks: we show that this yields parameter-efficient layers with negligible performance loss. Further, the CEM interpretation suggests appealing extensions to the transformer architecture: pre-conditioner-matrices for residual connections, diagonal matrices for inter-token-distances in attention, and multiple gradient-steps (a form of layer re-use) for both attention and MLP blocks. We show that these ideas that occur naturally in CEM lead to improvements on language modelling tasks, positioning CEM as a blueprint for principled and extensible architecture design.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18341
Loading