Keywords: preference optimization, dimensionality reduction, kv cache optimization
TL;DR: We introduce DimPO, a novel listwise preference optimization loss for dimensionality reduction of attention that reduces KV cache memory in LLMs by 10-15% with minimal performance loss.
Abstract: Large Language Models (LLMs) require substantial memory and computation time, particularly for long-context tasks. To handle long sequences, LLMs use KV caches, whose memory size grows linearly with the number of tokens. In this work, we focus on reducing KV cache memory by projecting key and query vectors into learned lower-dimensional spaces. We pose the problem - previously solved with triplet loss for Locality Sensitive Hashing (LSH) - as a preference optimization problem. We show that the preference optimization approach performs better mostly on higher dimensions indicating its potential for training attention in reduced dimensions. To address this, we introduce DimPO, a novel reference-model-free, listwise preference optimization loss. We demonstrate that DimPO more accurately preserves attention distributions in reduced dimensions compared to both existing preference optimization losses and triplet loss. Building on this, we apply DimPO-based dimensionality reduction to the attention layers of LLaMA3-[1B, 3B, 8B], Qwen2.5-7B and Qwen3-4B instruct models. On general benchmark tasks, DimPO Attentions reduces KV cache memory by 10-15% while maintaining 95% of performance. Larger models using DimPO Attentions on long-context tasks also exhibit only a marginal performance drop.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21616
Loading