Keywords: preference optimization, dimensionality reduction, kv cache optimization
TL;DR: We introduce DimPO, a novel listwise preference optimization loss for dimensionality reduction of attention that reduces KV cache memory in LLMs by 10-15% with minimal performance loss.
Abstract: Large Language Models (LLMs) require increasing memory and computational resources to process long context, motivating methods that reduce the dimensionality of key and query representations while preserving the resulting attention patterns. A previous triplet-based method for LSH-style projections optimizes only local pairwise relations and therefore struggles to capture the global ranking structure inherent to attention. We reframe dimensionality reduction of attention as a listwise learning problem based on preference optimization and ranking losses. Our analysis shows that listwise approaches better preserve full attention distributions in low-dimensional spaces, reducing KL divergence to less than 25\% of the value achieved by triplet-based and other pairwise methods. Building on this, we directly apply listwise-learned projections to the attention layers of LLaMA3-[1B, 3B, 8B], Qwen2.5-7B and Qwen3-4B instruct models. On general benchmark tasks, listwise projection reduces KV cache memory by 10–15% while maintaining 95% of model performance. Larger models using this projection on long-context tasks exhibit only a marginal performance drop and demonstrate compatibility with existing KV-cache reduction techniques such as SnapKV, improving throughput without appreciable accuracy degradation. These results establish listwise learning as a principled and effective approach for dimensionality reduction of attention, directly applicable to attention layers at inference.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21616
Loading