ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: A novel Doubly-Stochastic Attention mechanism using Expected Sliced Transport Plans offers faster, parallelizable computations with adaptive priors and competitive performance.
Abstract: While self-attention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly-stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces doubly stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models. Experiments across multiple benchmark datasets, including image classification, point cloud classification, sentiment analysis, and neural machine translation, demonstrate that our enhanced attention regularization consistently improves performance across diverse applications. Our implementation code can be found at \url{https://github.com/dariansal/ESPFormer}.
Lay Summary: We’ve found that modern “attention” in AI models—how they decide which pieces of data to focus on—can become overly concentrated on just a few inputs, causing the models to miss important context. Previous fixes forced a strict balance by running a slow, back-and-forth normalization routine. Our ESPFormer method instead uses a clever mathematical shortcut (expected sliced transport) to balance attention in one fully parallel step, and applies a gentle “soft sorting” trick so it fits seamlessly into regular training. The result is exactly balanced attention maps without the heavy iterative cost. When we tested ESPFormer on image recognition, 3D point-cloud classification, text sentiment analysis, and machine translation, it consistently improved accuracy and ran faster than earlier approaches. Our open-source code makes it easy for anyone to try and build on this more efficient, balanced attention mechanism.
Link To Code: https://github.com/dariansal/ESPFormer
Primary Area: Deep Learning->Attention Mechanisms
Keywords: Geometric deep learning, Optimal Transport, Attention mechanism, doubly-stochastic matrices
Submission Number: 12597
Loading