From Attention to Diffusion: A Unified Entropic Optimal Transport View

ICLR 2026 Conference Submission19055 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Optimal Transport, Transformer Attention, Diffusion LLM, Probability-flow ODE, Schrödinger Bridge alignment, Neural ODE
TL;DR: We prove attention and diffusion are discretizations of the same entropy-regularized OT flow—yielding a unified theory, finite-depth bounds, and actionable diagnostics
Abstract: We show that transformer attention and diffusion models are discretizations of the same entropy-regularized optimal transport (OT) flow. A single attention layer is a KL-proximal (JKO/mirror) step in an OT potential; stacking layers yields probability paths that converge to a probability–flow ODE (PF–ODE) on the simplex. Our construction uses a causal, semi-relaxed EOT that preserves attention masking while retaining OT geometry. We derive a finite-depth error bound controlled by a budget ΞL (quantifying continuum validity) and prove that stacked attention weakly approximates time-inhomogeneous, anisotropic reverse diffusions with an error that separates time discretization, logit variation, and optional degeneracy regularization. Geometrically, we characterize exact Schrödinger Bridge (SB) alignment via a rotational energy ℛ that vanishes if and only if the path is SB, and serves as a practical diagnostic otherwise. The framework yields testable predictions: (i) the continuum approximation is accurate when ΞL is small; (ii) depth exhibits diminishing returns beyond a threshold set by contraction and step size; and (iii) lower ℛ correlates with improved generations. We validate these predictions with a diagnostic suite (P0–P4): BV/continuity gating (with abstention on failure), PF–ODE adequacy, curvature/locking geometry, and SB energy. Evidence spans three tracks—Transformers (core diagnostics), diffusion LLMs (dLLM; late-window stability certificate), and a compact image diffusion model (parity and first-order weak-error behavior). These insights motivate mobility-aware temperature scheduling and certified early exit, conserving depth while preserving transport geometry.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 19055
Loading