Keywords: LLM, KV cache compression, Reasoning models
Abstract: Reasoning models like OpenAI-o1 and DeepSeek-R1 have demonstrated strong capabilities in complex tasks such as mathematical reasoning and code generation.
However, this leap in performance is achieved by generating a significantly greater number of output tokens, which dramatically increases deployment costs.
The generation of extremely long sequences necessitates a longer KV cache, which in turn results in a substantial memory footprint and severe bandwidth pressure during attention computation.
While there are numerous techniques to optimize KV cache, they are predominantly designed for long-input, short-output scenarios and are ineffective for the long-output nature of these reasoning models.
The high computational cost of their importance estimation is severely exacerbated in long-output scenarios by the need for continuous context re-evaluation.
To overcome this challenge, we introduce LongFlow, a novel KV cache compression method that employs an efficient importance estimation metric derived from an intermediate result in the attention computation using only the current query. This elegant design requires no auxiliary storage and adds negligible computational overhead.
Furthermore, we implement a custom kernel that integrates Flash-Attention, importance estimation, and token eviction into a single, highly optimized operator to enhance system-level efficiency.
Extensive experiments demonstrate that our method can achieve an 11.8x increase in throughput with an 80\% compression of the KV
cache while incurring negligible loss in model accuracy.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16469
Loading