Transformers as Optimal Transport: Stability, Geometry, and Gauge Symmetry

15 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Optimal Transport, Wasserstein Gradient Flow, Transformer Attention, Representation Locking, Gauge Symmetry, Stability Analysis
TL;DR: We formulate transformer attention as semi-relaxed entropic OT, derive stability/curvature bounds and variational inequalities linking depth to Wasserstein flow. Gauge-invariant GPT-2 experiments validate theory.
Abstract: Self-attention is row-wise entropic optimal transport: masked softmax exactly solves independent OT problems on each query's support with unit entropic regularization (ε=1)—not an approximation, but a precise mathematical equivalence. This yields a compositional stability theory via a global ℓ∞→ℓ₁ Lipschitz bound across heads, residuals, and LayerNorm, producing a conservative drift budget and explaining representation locking through local saturation when δ(P)→0. We introduce gauge-invariant coarse Ricci curvature with τ-dependent bounds linking temperature and key scale to contraction, and show depth behaves as Wasserstein gradient flow via an evolution variational inequality. Empirically on GPT-2 variants, measured drift sits well below theoretical budgets (tightness ratio ≈ 0.043), locking occurs in ~10% of samples (TV <10⁻¹⁰), Sinkhorn W₂ concentrates in mid-depth, and curvature gaps tighten with larger τ or smaller key scale as predicted. We prove depth cannot collapse: compositions generically lack single-layer representations with the same key dimension. We report extrinsic Euclidean quantities in a declared canonical gauge. The framework provides actionable design principles for temperature, key scaling, and early exit while organizing attention into a coherent geometric structure.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5326
Loading