Knocking-Heads Attention

ACL ARR 2026 January Submission10890 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Attention head interaction, training stability
Abstract: Talking-heads attention mitigates the issue of head isolation in existing multi-head attention mechanisms by linearly mixing attention maps across heads. While this improves representational capacity, it is incompatible with existing attention kernels and incurs significant computational overhead. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to ``knock" on each other — facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to standard attention, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Generation, Language Modeling
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 10890
Loading