Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle

18 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformer, energy principle, attention mechanism
TL;DR: This paper revisits the principle of energy to view attention-based Transformer models.
Abstract: Transformers have demonstrated strong adaptability across a wide range of tasks and become the backbone of modern Large Language Models (LLMs). However, their underlying mechanisms remain open for further exploration. The energy-based perspective has long provided a valuable principle for understanding neural computation. In this paper, we revisit the energy principle as a framework for understanding attention-based Transformers. Within the proposed framework, standard attention can be viewed as a special case of minimizing the Helmholtz free energy when the energy function takes the form of elastic potential energy, with residual connections ensuring that this optimization proceeds in an incremental manner. Building on this connection, we incorporate the forward pass and parameter updates during model training into a unified alternating optimization perspective where parameter updates follow conventional training objectives while the model architecture is responsible for locally optimizing on the energy-based regularization. Furthermore, we extend the first-order energy update of standard attention to a second-order form based on Newton’s method, which ultimately introduces a covariance matrix to precondition the update directions of tokens. Meanwhile, we extend the above analysis to the multi-head case, where energy minimization is performed across multiple low-dimensional subspaces. Our experiments provide preliminary support for the potential of using the energy-based framework to design attention mechanisms.
Primary Area: interpretability and explainable AI
Submission Number: 11501
Loading