Keywords: transformers, efficient attention, positional encoding, language modeling, memory optimization
TL;DR: Benchmarking different Transformer architectures, comparing inference and quality
Abstract: We present the first comprehensive study of latent multi-head attention (MLA)
for small language models, revealing interesting efficiency-quality trade-offs.
Training ~30M-parameter Generative Pre-trained Transformer (GPT) models on 100,000 synthetic stories,
we benchmark three architectural variants: standard multi-head attention (MHA),
MLA, and MLA with rotary positional embeddings (\ours). Our key finding is that
\ours with half-rank latent dimensions ($r=d/2$) achieves a 45\% Key-Value (KV)-cache memory
reduction while incurring only a 0.3\% increase in validation loss (essentially matching MHA's quality), a Pareto improvement
for memory-constrained deployment. We further show that RoPE is crucial for MLA
in small models: Without it, MLA underperforms vanilla attention by 3-5\%, but
with RoPE, it surpasses vanilla by 2\%. Inference benchmarks on NVIDIA A100 Graphics Processing Units
(GPUs)
reveal that MLA with $r=d/2$ achieves a 1.4× speedup over full-rank MLA while maintaining memory savings. GPT-4 evaluations corroborate perplexity
results, with \ours achieving the highest quality scores (7.4/10) on
grammar, creativity, and consistency metrics. The code and models will be released
upon acceptance.
Submission Number: 7
Loading