Keywords: Optimal Transport, Wasserstein Gradient Flow, Transformer Attention, Representation Locking, Gauge Symmetry, Stability Analysis
TL;DR: We formulate transformer attention as semi-relaxed entropic OT, derive stability/curvature bounds and variational inequalities linking depth to Wasserstein flow. Gauge-invariant GPT-2 experiments validate theory.
Abstract: Self-attention is row-wise entropic optimal transport: masked softmax
exactly solves independent OT problems on each query's support with unit
entropic regularization (ε=1)—not an approximation, but a precise
mathematical equivalence. This yields a compositional stability theory
via a global ℓ∞→ℓ₁ Lipschitz bound across heads, residuals, and LayerNorm,
producing a conservative drift budget and explaining representation locking
through local saturation when δ(P)→0. We introduce gauge-invariant coarse
Ricci curvature with τ-dependent bounds linking temperature and key scale
to contraction, and show depth behaves as Wasserstein gradient flow via
an evolution variational inequality. Empirically on GPT-2 variants,
measured drift sits well below theoretical budgets (tightness ratio ≈ 0.043),
locking occurs in ~10% of samples (TV <10⁻¹⁰), Sinkhorn W₂ concentrates
in mid-depth, and curvature gaps tighten with larger τ or smaller key
scale as predicted. We prove depth cannot collapse: compositions generically
lack single-layer representations with the same key dimension. We report
extrinsic Euclidean quantities in a declared canonical gauge. The framework
provides actionable design principles for temperature, key scaling, and
early exit while organizing attention into a coherent geometric structure.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5326
Loading