TL;DR: We propose a simple method to reduce memory usage and decoding latency while maintaining performance in long-context LLM inference.
Abstract: Large language models (LLMs) have demonstrated strong capabilities in handling long-context tasks, but processing such long contexts remains challenging due to the substantial memory requirements and inference latency. In this work, we discover that certain attention heads exhibit sequential consistency in their attention patterns, which can be persistently identified using a coefficient-of-variation-based algorithm. Inspired by this observation, we propose CateKV, a hybrid KV cache method that retains only critical token information for consistent heads, thereby reducing KV cache size and computational overhead, while preserving the majority of KV pairs in adaptive heads to ensure high accuracy. We show the unique characteristics of our algorithm and its extension with existing acceleration methods. Comprehensive evaluations on long-context benchmarks show that, while maintaining accuracy comparable to full attention, CateKV reduces memory usage by up to $2.72\times$ and accelerates decoding by $2.18\times$ in single-sample inputs, and boosts throughput by $3.96\times$ in batch scenarios.
Lay Summary: Modern language models excel at understanding long texts but often require significant memory and time to process them. We discovered that certain parts of these models consistently focus on specific information, which inspired us to develop CateKV—a method that retains only the most important information from these parts while preserving more detail where necessary. This approach reduces memory usage and speeds up processing without sacrificing accuracy. Our experiments show that CateKV can reduce memory consumption by nearly three times and double the speed for single-sample inputs, while boosting throughput for batch inputs by almost four times. This makes handling long documents more efficient and practical.
Primary Area: Deep Learning->Large Language Models
Keywords: Large Lauguage Models, Acceleration, Long context
Submission Number: 3204
Loading