Transformers, parallel computation, and logarithmic depth

Published: 02 May 2024, Last Modified: 25 Jun 2024ICML 2024 SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We show that a constant number of self-attention layers can efficiently simulate—and be simulated by—a constant number of communication rounds of *Massively Parallel Computation*. As a consequence, we show that logarithmic-depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.
Submission Number: 8083
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview