Transformers, parallel computation, and logarithmic depth

Published: 02 May 2024, Last Modified: 25 Jun 2024ICML 2024 SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We show that a constant number of self-attention layers can efficiently simulate—and be simulated by—a constant number of communication rounds of *Massively Parallel Computation*. As a consequence, we show that logarithmic-depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.
Submission Number: 8083
Loading