DeltaFormer: Unlock the state space of Transformer

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformer, Circuit Complexity, Model Architecture
TL;DR: We propose a more expressive Transformer, which exceeding its original $TC^0$ expressiveness.
Abstract: In recent years, large language models with Transformer architecture as the core have made breakthrough progress in many fields. At the same time, there are also some weaknesses in the large language model that have prompted people to reflect, among which the most fundamental one is the reflection on the Transformer architecture. The Transformer architecture has high parallelism and can fully utilize the computing power of GPUs, thus replacing models such as LSTM in the past few years. However, high parallelism is not a free lunch, as it fundamentally limits the performance of models. Especially, the problems that logarithmic precision Transformer architecture can solve are strictly limited to the $TC^0$. And there are many important issues that are usually considered out of $TC^0$, such as Python code evaluation, entity tracking, chess, and other state tracking tasks. Meanwhile, some recent state space methods based on Delta Rule have been able to break through the $TC^0$ architecture, but they are limited by fixed size state spaces and perform poorly on many tasks. To this end, we have re-examined the Transformer from the perspective of a state space with kernel functions and propose an improved Transformer called DeltaFormer. We have theoretically and practically demonstrated that the proposed new architecture can break through the limitation of the inherent $TC^0$ expressivity of Transformers and verified that it is not weaker than standard Transformer in language modeling tasks. We hope our work can provide inspiration for designing more expressive models.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 27463
Loading