Understanding Transformer Architecture through Continuous Dynamics: A Partial Differential Equation Perspective

ICLR 2026 Conference Submission5950 Authors

15 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformer Architecture,Information Bottleneck,Partial Differential Equation (PDE)
Abstract: The Transformer architecture has revolutionized artificial intelligence, yet a principled theoretical understanding of its internal mechanisms remains elusive. This paper introduces a novel analytical framework that reconceptualizes the Transformer’s discrete, layered structure as a continuous spa- tiotemporal dynamical system governed by a master Partial Differential Equation (PDE). Within this paradigm, we map core architectural components to distinct mathematical operators: self-attention as a non-local interaction, the feed-forward network as a local reaction, and, critically, residual con- nections and layer normalization as indispensable stabilization mechanisms. We do not propose a new model, but rather employ the PDE system as a theoretical probe to analyze the mathemat- ical necessity of these components. By comparing a standard Transformer with a PDE simulator that lacks explicit stabilizers, our experiments provide compelling empirical evidence for our cen- tral thesis. We demonstrate that without residual connections, the system suffers from catastrophic representational drift, while the absence of layer normalization leads to unstable, explosive train- ing dynamics. Our findings reveal that these seemingly heuristic ”tricks” are, in fact, fundamental mathematical stabilizers required to tame an otherwise powerful but inherently unstable continuous system. This work offers a first-principles explanation for the Transformer’s design and establishes a new paradigm for analyzing deep neural networks through the lens of continuous dynamics.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5950
Loading