KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs
Keywords: quantization, kv cache, transformer, llm, attention
TL;DR: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs
Abstract: Key-Value (KV) cache quantization has become a widely adopted optimization technique for efficient large language models (LLMs) inference by reducing KV cache memory usage and mitigating memory-bound constraints.
Recent studies have emphasized the importance of preserving the original precision of KVs for the first few tokens to ensure the protection of attention sinks.
While this approach has proven effective in mitigating performance degradation, its underlying principles remain insufficiently understood.
Moreover, it fails to address the recent discovery that attention sinks can emerge beyond the initial token positions.
In this work, we elucidate the underlying mechanisms of attention sinks during inference by examining their role in the cross-layer evolution of extreme activation outliers.
Additionally, we provide a comprehensive analysis of the interplay between attention sinks and KV cache quantization.
Based on our enhanced understanding, we introduce KVSink, a plug-and-play method that effectively predicts sink tokens with negligible overhead, enabling more thorough preservation.
Extensive experiments demonstrate that KVSink outperforms the existing Preserve-First-N (PFN) strategy, offering more effective preservation of attention sinks during KV cache quantization.
Moreover, when applied to the well-established KVQuant method, KVSink further improves perplexity (PPL) and reduces reliance on 16-bit numerical outliers.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Award Nomination: true
Submission Number: 22
Loading