Head-in-Head in Linear Attention

Shijie Mei; Man Yao; Jiabo Tong; Bo XU; Guoqi Li

Head-in-Head in Linear Attention

Shijie Mei, Man Yao, Jiabo Tong, Bo XU, Guoqi Li

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Head‑in‑Head is a method for extending the low‑rank approximation of state‑transition matrices that contain non‑diagonal elements. It achieves structured, selective enhancement by regrouping memory units within each head.

Abstract: The state-transition (decay) matrix governs how fixed-size memory is updated and used, making it a core design in linear attention models. Prior work exploits rank-1 approximations to reduce the cost of constructing decay matrices, but this low-rank constraint also limits the expressive capacity. We therefore formulate decay-matrix design as an open optimization problem: maximizing expressiveness while introducing minimal additional cost. Inspired by the multi-head mechanism, we propose Head-in-Head, which introduces an additional mask matrix to structure memory partitioning and interactions within a single linear-attention head. This simple, generic, and efficient design: 1) enables a rank-$r$ approximation of the decay matrix with only a few extra parameters and 2) strengthens intra-head information interaction. We further develop mask normalization and a chunk-wise parallelization scheme to support efficient parallel training. Extensive experiments on synthetic benchmarks and language modeling tasks, together with visual analyses, show that Head-in-Head consistently improves baseline performance by enriching information diversity and strengthening intra-head interactions. Code available at: \url{https://github.com/msj-19/Head-in-Head-Linear-Attention}

Lay Summary: The core component of linear models is the state transition matrix, which mixes information across time steps by operating on the state from the previous time step. In this work, we investigate how far existing low-rank generation algorithms deviate from fully dense matrices, and explore how to build a bridge between the two. To this end, we introduce an additional mask matrix that performs block‑wise weighting on the original low‑rank part, effectively increasing the corresponding rank while enabling selective control over the partitioned memory states of the model. Our approach enhances the cross‑row interaction rank of the state transition matrix with minimal parameter overhead and demonstrates strong performance across various tasks. Our work reveals new possibilities for further design of state transition matrices in linear models. We have also developed corresponding operator implementations to facilitate further research in this direction.

Link To Code: https://github.com/msj-19/Head-in-Head-Linear-Attention

Primary Area: Deep Learning->Foundation Models

Keywords: Linear attention, Large Language Models

Originally Submitted PDF: pdf

Submission Number: 21808

Loading