MoH: Multi-Head Attention as Mixture-of-Head Attention

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose Mixture-of-Head attention (MoH) that outperforms multi-head attention even by using only 50%~90% of the attention heads.
Abstract: In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to reduce computational costs while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%$\sim$90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.
Lay Summary: Transformers are the backbone of many modern AI models, helping computers understand language, recognize images, and even generate art. A key part of how transformers work is called “multi-head attention,” which allows the model to focus on different parts of the input simultaneously. However, not all of these attention “heads” are equally useful—some do more work than others. In our research, we introduce a smarter version of this system called Mixture-of-Head Attention (MoH). Instead of using all attention heads all the time, MoH lets the model choose only the most helpful ones for each piece of input, like picking the best team members for a job. This makes the model faster and more efficient without hurting its performance—and often makes it even better. We tested MoH on a range of tasks, including understanding images, generating pictures, and answering questions. In every case, MoH matched or outperformed traditional methods, even when using fewer attention heads. We also showed that existing models, like LLaMA3, can be upgraded to use MoH, making it easy to apply our method to today’s top AI systems. MoH is a promising step toward making powerful AI models faster, cheaper, and more adaptable.
Link To Code: https://github.com/SkyworkAI/MoH
Primary Area: Deep Learning->Attention Mechanisms
Keywords: Multi-Head Attention, Mixture of Experts, Foundation Models
Submission Number: 6339
Loading