Keywords: Multi Latent Attention, Covariance & Rank aware, Singular value decomposition
TL;DR: CARE converts other Attention→MLA under the same KV budget: covariance-aware SVD to minimize output error, rank allocation to maximize retained energy. On LLaMA: at most 331% improvement zero-shot. Restore accuracy after healing.
Abstract: Converting pretrained attention modules such as *grouped-query attention* (GQA) into *multi-head latent attention* (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, existing conversion methods typically apply naïve singular value decomposition (SVD). They focus on minimizing the difference between weight matrices rather than how those weights affect input activations, ignore the covariance structure of activations, and enforce a uniform rank across layers—causing activation drift and degraded attention fidelity. To address these issues, we propose **CARE** (**C**ovariance-**A**ware, **R**ank-**E**nhanced), a MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i). **Activation-preserving factorization** — align the approximation with the actual input activations rather than just the weights. (ii). **Adjusted-rank allocation** — distribute a fixed KV budget across layers by giving more capacity to layers that need it most. (iii). **KV-parity mapping** — reparameterize the converted \(K\) and \(V\) to fit the MLA format while keeping the KV-cache size unchanged. Under a matched KV-cache budget, our method consistently outperforms a uniform-rank SVD baseline on Llama-3-8B, delivering up to **331%** relative gains in one-shot evaluation (higher accuracy, lower perplexity). With a brief post-SVD “healing” fine-tune, we fully recover the original model’s accuracy.
Supplementary Material: pdf
Primary Area: generative models
Submission Number: 2678
Loading