Knocking-Heads Attention

Knocking-Heads Attention

ACL ARR 2026 January Submission10890 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Attention head interaction, training stability

Abstract: Talking-heads attention mitigates the issue of head isolation in existing multi-head attention mechanisms by linearly mixing attention maps across heads. While this improves representational capacity, it is incompatible with existing attention kernels and incurs significant computational overhead. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to ``knock" on each other — facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to standard attention, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.

Paper Type: Long

Research Area: Language Models

Research Area Keywords: Generation, Language Modeling

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 10890

Loading