On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding

Kevin Xu; Issei Sato

On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding

Kevin Xu, Issei Sato

Published: 01 May 2025, Last Modified: 02 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

TL;DR: We establish an approximation rate for the Looped Transformer and enhance its expressive power via timestep encoding.

Abstract: Looped Transformers provide advantages in parameter efficiency, computational capabilities, and generalization for reasoning tasks. However, their expressive power regarding function approximation remains underexplored. In this paper, we establish the approximation rate of Looped Transformers by defining the modulus of continuity for sequence-to-sequence functions. This reveals a limitation specific to the looped architecture. That is, the analysis prompts the incorporation of scaling parameters for each loop, conditioned on timestep encoding. Experiments validate the theoretical results, showing that increasing the number of loops enhances performance, with further gains achieved through the timestep encoding.

Lay Summary: Transformers typically apply multiple layers once to the input, but what happens if we reuse the same layer multiple times? We explore how looping, a core concept in computer science, can benefit neural networks. In this work, we show a surprising result: even with a fixed number of parameters in a single layer, performance improves as we increase the number of loops. In contrast, conventional Transformers require increasing the parameter count to achieve better results. We further analyze a fundamental limitation of this looped design: after a point, adding more loops no longer helps. To address this, we propose a simple solution—injecting timestep information into each loop, which overcomes the barrier and boosts performance further. Our findings provide a deeper theoretical understanding of Looped Transformers and suggest a new modeling paradigm: achieving high performance through parameter-efficient architectures. This opens the door to more compact, generalizable, and resource-friendly models.

Link To Code: https://github.com/kevin671/tmlt

Primary Area: Deep Learning->Attention Mechanisms

Keywords: Transformer, Looped Transformer, Expressive power, Approximation Rate, HyperNetworks

Submission Number: 10057

Loading