On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
TL;DR: We establish an approximation rate for the Looped Transformer and enhance its expressive power via timestep encoding.
Abstract: Looped Transformers provide advantages in parameter efficiency, computational capabilities, and generalization for reasoning tasks. However, their expressive power regarding function approximation remains underexplored. In this paper, we establish the approximation rate of Looped Transformers by defining the modulus of continuity for sequence-to-sequence functions. This reveals a limitation specific to the looped architecture. That is, the analysis prompts the incorporation of scaling parameters for each loop, conditioned on timestep encoding. Experiments validate the theoretical results, showing that increasing the number of loops enhances performance, with further gains achieved through the timestep encoding.
Lay Summary: Transformers typically apply multiple layers once to the input, but what happens if we reuse the same layer multiple times? We explore how looping, a core concept in computer science, can benefit neural networks. In this work, we show a surprising result: even with a fixed number of parameters in a single layer, performance improves as we increase the number of loops. In contrast, conventional Transformers require increasing the parameter count to achieve better results. We further analyze a fundamental limitation of this looped design: after a point, adding more loops no longer helps. To address this, we propose a simple solution—injecting timestep information into each loop, which overcomes the barrier and boosts performance further. Our findings provide a deeper theoretical understanding of Looped Transformers and suggest a new modeling paradigm: achieving high performance through parameter-efficient architectures. This opens the door to more compact, generalizable, and resource-friendly models.
Link To Code: https://github.com/kevin671/tmlt
Primary Area: Deep Learning->Attention Mechanisms
Keywords: Transformer, Looped Transformer, Expressive power, Approximation Rate, HyperNetworks
Submission Number: 10057
Loading