Probability Distributions Computed by Hard-Attention Transformers

Probability Distributions Computed by Hard-Attention Transformers

ICLR 2026 Conference Submission22229 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: expressivity, weighted automata, language models, transformers, linear temporal logic

TL;DR: We exactly characterize the probability distributions expressed by transformers when viewed as language recognizers and autoregressive models.

Abstract: Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 22229

Loading