Transformers, parallel computation, and logarithmic depth

Clayton Sanford; Daniel Hsu; Matus Telgarsky

Transformers, parallel computation, and logarithmic depth

Clayton Sanford, Daniel Hsu, Matus Telgarsky

Published: 02 May 2024, Last Modified: 25 Jun 2024ICML 2024 SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We show that a constant number of self-attention layers can efficiently simulate—and be simulated by—a constant number of communication rounds of *Massively Parallel Computation*. As a consequence, we show that logarithmic-depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers.

Submission Number: 8083

Loading