Keywords: Matryoshka models, multi-head latent attention, adaptive inference, KV cache, efficient language models
TL;DR: MatMLA combines Matryoshka-style nested attention heads with MLA’s fixed latent KV cache, enabling multiple compute budgets within one model while avoiding the cache-layout issues of nested MHA.
Abstract: Language models are increasingly deployed under diverse inference constraints, from low-latency interactive serving to higher-quality generation with larger compute budgets. Matryoshka-style Transformers address this setting by exposing multiple compute-quality operating points within a single set of weights, avoiding the need to train, store, and serve separate models for each budget.
In parallel, multi-head latent attention (MLA) improves autoregressive decoding by replacing full per-head Key-Value (KV) caches with a compact latent cache. These ideas target complementary aspects of inference efficiency: Matryoshka controls the compute budget, while MLA controls the cache footprint. We introduce Matryoshka Multi-Head Latent Attention (MatMLA), a Matryoshka-style extension of MLA that exposes multiple attention-head budgets within a single latent-attention module. MatMLA inherits the compact KV cache of MLA while adding the nested sub-model structure of Matryoshka-style attention. This also resolves a key limitation of nested Multi-Head Attention (MHA): varying the active head count during inference requires irregular KV-cache layouts or full caches with inactive heads, whereas MatMLA's fixed-size latent cache gives all sub-models a shared memory format. We show that a 210M-parameter MatMLA model trained on 4.2B FineWeb-Edu tokens with nested head budgets of 12, 8, and 4 heads achieves perplexities close to the corresponding fixed-head
MLA baselines, while preserving MLA's compact KV cache and enabling a shared nested parameterization across compute budgets.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 189
Loading