Residual Matrix Transformers: Scaling the Size of the Residual Stream

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We present an improved transformer variant whose residual steam is replaced with an outer product memory matrix.
Abstract: The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties.
Lay Summary: Traditional AI language models are based on the transformer architecture. The transformer uses a "residual stream" - essentially a memory highway where different parts of the model store and retrieve information. However, this memory system is limited in how much information it can store and it is expensive to scale up, requiring more computational power and parameters. We replaced this traditional memory highway with a new "outer product memory matrix" system, creating what we call the Residual Matrix Transformer (RMT). This new approach allows the size of the "residual stream" to scale independently from the overall model size and computation requirements. Our RMT achieves the same performance as traditional transformers while using 58% fewer computational operations, 25% fewer parameters, and 41% less training data. This breakthrough could make powerful AI language models more efficient and accessible, reducing the computational costs and energy requirements for training and running advanced AI systems.
Link To Code: https://github.com/bmac3/residual-matrix-transformer
Primary Area: Deep Learning->Large Language Models
Keywords: transformer models, deep learning, neural network architectures, efficient language modeling
Submission Number: 9560
Loading