Latent Multi-Head Attention for Small Language Models

Published: 04 Jul 2025, Last Modified: 22 Jul 2025KDD 2025 Workshop on Inference Optimization for GenAI PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: transformers, efficient attention, positional encoding, language modeling, memory optimization
TL;DR: Benchmarking different Transformer architectures, comparing inference and quality
Abstract: We present the first comprehensive study of latent multi-head attention (MLA) for small language models, revealing interesting efficiency-quality trade-offs. Training ~30M-parameter Generative Pre-trained Transformer (GPT) models on 100,000 synthetic stories, we benchmark three architectural variants: standard multi-head attention (MHA), MLA, and MLA with rotary positional embeddings (\ours). Our key finding is that \ours with half-rank latent dimensions ($r=d/2$) achieves a 45\% Key-Value (KV)-cache memory reduction while incurring only a 0.3\% increase in validation loss (essentially matching MHA's quality), a Pareto improvement for memory-constrained deployment. We further show that RoPE is crucial for MLA in small models: Without it, MLA underperforms vanilla attention by 3-5\%, but with RoPE, it surpasses vanilla by 2\%. Inference benchmarks on NVIDIA A100 Graphics Processing Units (GPUs) reveal that MLA with $r=d/2$ achieves a 1.4× speedup over full-rank MLA while maintaining memory savings. GPT-4 evaluations corroborate perplexity results, with \ours achieving the highest quality scores (7.4/10) on grammar, creativity, and consistency metrics. The code and models will be released upon acceptance.
Submission Number: 7
Loading