Attention (as Discrete-Time Markov) Chains

Yotam Erel; Olaf Dünkel; Rishabh Dabral; Vladislav Golyanik; Christian Theobalt; Amit Haim Bermano

Attention (as Discrete-Time Markov) Chains

Yotam Erel, Olaf Dünkel, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, Amit Haim Bermano

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: attention, transformers, Markov chain

TL;DR: We interpret attention as discrete-time markov chains and show its effectiveness on various downstream tasks.

Abstract: We introduce a new interpretation of the attention matrix as a discrete-time Markov chain. Our interpretation sheds light on common operations involving attention scores such as selection, summation, and averaging in a unified framework. It further extends them by considering indirect attention, propagated through the Markov chain, as opposed to previous studies that only model immediate effects. Our key observation is that tokens linked to semantically similar regions form metastable states, i.e., regions where attention tends to concentrate, while noisy attention scores dissipate. Metastable states and their prevalence can be easily computed through simple matrix multiplication and eigenanalysis, respectively. Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation. Lastly, we define TokenRank---the steady state vector of the Markov chain, which measures global token importance. We show that TokenRank enhances unconditional image generation, improving both quality (IS) and diversity (FID), and can also be incorporated into existing segmentation techniques to improve their performance over existing benchmarks. We believe our framework offers a fresh view of how tokens are being attended in modern visual transformers.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 6725

Loading