Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

ACL ARR 2026 January Submission4147 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Server Engine, KV Cache Eviction, LLM Efficiency

Abstract: With reasoning becoming the generative paradigm for large language models, the memory bottleneck caused by KV cache during the inference phase has become a critical factor limiting high-concurrency service capabilities. Although existing KV cache eviction methods address the memory issue, most of them are impractical for industrial-grade applications. This paper introduces Compressed PagedAttention, a method that combines token-wise KV cache eviction with PagedAttention. We propose a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention. Based on this, we have developed a high-concurrency inference engine, Zipage. On large-scale mathematical reasoning tasks, Zipage achieves around 95\% of the performance of Full KV inference engines while delivering over 2.1$\times$ speedup.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: LLM Efficiency; NLP in resource-constrained settings;

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 4147

Loading