Keywords: Transformers, Depth, Layers, Formal Language Theory, Temporal Logic, Expressivity
TL;DR: We prove precisely how deeper transformers (with appropriate rounding) become more expressive, and show that empirical behaviour tracks our theory.
Abstract: It has been observed that transformers with greater depth (that is, more layers) have more capabilities, but can we establish formally which capabilities are gained? We answer this question with a theoretical proof followed by an empirical study. First, we consider transformers that round to fixed precision except inside attention. We show that this subclass of transformers is expressively equivalent to the programming language $\textsf{C}$-$\textsf{RASP}$ and this equivalence preserves depth. Second, we prove that deeper $\textsf{C}$-$\textsf{RASP}$ programs are more expressive than shallower $\textsf{C}$-$\textsf{RASP}$ programs, implying that deeper transformers are more expressive than shallower transformers (within the subclass mentioned above). The same is also proven for transformers with positional encodings (like RoPE and ALiBi). These results are established by studying a temporal logic with counting operators equivalent to $\textsf{C}$-$\textsf{RASP}$. Finally, we provide empirical evidence that our theory predicts the depth required for transformers without positional encodings to length-generalize on a family of sequential dependency tasks.
Supplementary Material: zip
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 25436
Loading