Expert Attention: MoE-Based Head Decoupling and Pruning for Pretrained Encoders

ACL ARR 2025 May Submission5264 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Encoder-only models benefit from bidirectional attention, enabling high parallelism and strong throughput, making them suitable for large-scale supervised tasks. However, their inference efficiency remains a bottleneck in real-world deployment. We propose Expert Attention, a Mixture-of-Experts (MoE)–based method that decouples each attention head as an independent expert. A gating mechanism dynamically selects which heads to activate, guided by a two-stage training strategy of load balancing followed by specialization. After training, a Top-1 selection strategy prunes unused heads, significantly improving throughput. Unlike prior pruning methods, our approach is purely architectural—requiring no complex scoring functions—making it simple and practical. Experiments show that Expert Attention achieves substantial speedups with minimal performance loss, outperforming existing attention head pruning techniques.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: pruning;parameter-efficient-training
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 5264
Loading