A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers

William Merrill; Ashish Sabharwal

A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers

William Merrill, Ashish Sabharwal

Published: 11 Oct 2024, Last Modified: 10 Dec 2024M3L PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: transformers, expressivity

TL;DR: Transformers with log depth can express problems inexpressible by transformers with constant depth.

Abstract: Most analysis of transformer expressivity treats the depth (number of layers) of a model as a fixed constant, and analyzes the kinds of problems such models can solve across inputs of unbounded length. In practice, however, the context length of a trained transformer model is bounded. Thus, a more pragmatic question is: *What kinds of computation can a transformer perform on inputs of bounded length?* We formalize this by studying highly uniform transformers where the depth can grow minimally with context length. In this regime, we show that transformers with depth $\textrm{O}(\log C)$ can, in fact, compute solutions to two important problems for inputs bounded by some max context length $C$, namely *recognizing regular languages*, which captures state tracking abilities, and *graph connectivity*, which underlies multi-step reasoning. Notably, both of these problems have previously been proven to be asymptotically beyond the reach of fixed depth transformers under standard complexity conjectures, yet empirically we find transformers of reasonable depth can recognize regular languages over short contexts. Our novel analysis thus explains how transformer models may rely on depth to feasibly solve problems up to bounded context that they cannot solve over long contexts. It makes actionable suggestions for practitioners as to how to minimally scale the depth of a transformer to support reasoning over long contexts, and also argues for dynamically unrolling depth as a more effective way of adding compute compared to increasing model dimension or adding a short chain of thought.

Is Neurips Submission: No

Submission Number: 82

Loading