Attention with Markov: A Curious Case of Single-layer Transformers

Published: 24 Jun 2024, Last Modified: 31 Jul 2024ICML 2024 MI Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Markov chains, Transformers, Optimization, Landscape
TL;DR: We theoretically characterize the loss landscape of single-layer transformers and show the existence of global minima (bigram) and bad local minima (unigram) contingent upon the specific data characteristics and the transformer architecture.
Abstract: In recent years, attention-based transformers have achieved tremendous success across a variety of disciplines including natural languages. To better understand the sequential modeling capabilities of transformers, there is a growing interest in using Markov input processes to study them. While previous research has shown that transformers with two or more layers develop an induction head mechanism to estimate the bigram conditional distribution, we find a surprising empirical phenomenon that single-layer transformers can get stuck at local minima, corresponding to unigrams. To explain this, we introduce a new framework for a principled theoretical and empirical analysis of transformers via Markov chains. Leveraging our framework, we theoretically characterize the loss landscape of single-layer transformers and show the existence of global minima (bigram) and bad local minima (unigram) contingent upon the specific data characteristics and the transformer architecture. Further, we precisely characterize the regimes under which these local optima occur. Backed by experiments, we demonstrate that our theoretical findings are in congruence with the empirical results. Finally, we outline several open problems in this arena. Code is available at \url{https://anonymous.4open.science/r/Attention-with-Markov-A617/}.
Submission Number: 118
Loading