Keywords: Continuous Dynamical Systems
Abstract: We propose PDE-Transformer, a novel sequence‐modeling paradigm that casts the forward pass of a Transformer as the numerical discretization of a continuous reaction–diffusion system derived from a variational energy functional. In our framework, token embeddings evolve under a partial differential equation whose nonlocal integral term models self-attention, local reaction term models feed-forward layers, diffusion term encodes positional smoothing, and a stability control term corresponds to layer normalization. From this unifying perspective, we design an Adaptive PDE Diffusion Layer—an efficient, learnable finite-difference stencil that enforces local smoothness in feature space with linear time complexity and complements self-attention’s global routing. Through a systematic theoretical analysis based on four pillars (stability, diffusion geometry, multi-scale dynamics, and component coupling), we derive principled guidelines for integrating the PDE layer at seven candidate points in the Transformer. Empirically, on the Long Range Arena benchmark, placing the layer immediately after embedding yields a 4.1 pp average accuracy gain over a strong baseline, and an adaptive multi-scale variant delivers further improvements. Our work thus offers a principled, lightweight mechanism to bolster long-range dependency modeling by harmonizing continuous PDE smoothing with discrete self-attention.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5962
Loading