CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

Zhongzhu Zhou; Fengxiang Bie; Ziyan Chen; Zhenyu Zhang; Yibo Yang; Junxiong Wang; Ben Athiwaratkun; Xiaoxia Wu; Shuaiwen Leon Song

CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, Shuaiwen Leon Song

Published: 26 Jan 2026, Last Modified: 30 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi Latent Attention, Covariance & Rank aware, Singular value decomposition

TL;DR: CARE converts pretrained GQA/MHA to MLA at KV-parity via covariance-aware SVD and adjusted-rank allocation, reducing perplexity up to 215x and improving accuracy up to 21 points over baselines on Llama-3.1-8B/70B and Qwen3-4B/30B-A3B.

Abstract: Converting pretrained attention modules such as *grouped-query attention* (GQA) into *multi-head latent attention* (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers—causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a ***C**ovariance-**A**ware, **R**ank-**E**nhanced* MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) *activation-preserving factorization*, which aligns the approximation with the actual input activations rather than just the weights; (ii) *adjusted-rank allocation*, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (iii) *KV-parity mapping*, which reparameterizes the converted K and V to fit the MLA format while keeping the KV-cache size unchanged. Our method outperforms a uniform-rank SVD baseline on Qwen3-4B/30B-A3B-Instruct-2507 and Llama-3.1-8B/70B-Instruct, reducing one-shot perplexity by up to 215× and improving mean accuracy by up to 1.70× at matched KV budgets. With a brief post-SVD "healing" fine-tune, we fully recover the original model's accuracy.

Primary Area: generative models

Submission Number: 2678

Loading