MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose MUltiway Dynamic Dense (MUDD) connections to significantly improve Transformer by enhancing cross-layer information flow. MUDDFormer matches performance of Transformer trained with ~1.8x-2.4x compute in language modeling and scales well.
Abstract: We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving performance of Transformers trained with ~1.8x--2.4x compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code in JAX and PyTorch and pre-trained models are available at https://github.com/Caiyun-AI/MUDDFormer.
Lay Summary: Most Transformer large language models rely on “residual connections,” fixed shortcuts that pass information from one layer to the next. These shortcuts act like a narrow service road: as traffic grows, they clog up, so builders respond by widening the entire highway—training ever-bigger, more expensive models. We introduce Multiway Dynamic Dense (MUDD) connections, which instead add flexible extra lanes only where and when traffic needs them. For every token the model processes and for every internal signal (queries, keys, values, etc.), MUDD automatically chooses the best blend of old and new information, using connection weights that adapt on the fly. Plugging MUDD connections into standard Transformer architectures—creating a model we call MUDDFormer—lets a 2.8-billion-parameter model match a 6.9-billion-parameter model, while adding just 0.4% more computation. In practice, this cuts the training cost and energy use of state-of-the-art language models by up to 60%. This breakthrough not only makes cutting-edge AI more accessible and sustainable but also offers a path to developing highly capable yet more efficient AI systems. By releasing our MUDDFormer designs and source code openly, we hope to empower more researchers and organizations to build better and greener models.
Link To Code: https://github.com/Caiyun-AI/MUDDFormer
Primary Area: Deep Learning->Foundation Models
Keywords: model architecture, Transformer, residual connections, dense skip connections, LLMs, pre-training
Flagged For Ethics Review: true
Submission Number: 6296
Loading