PiKV:  KV Cache Management System for MoE Architecture

Dong Liu; Yanxuan Yu; Ben Lengerich; Ying Nian Wu; Xuhong Wang

PiKV: KV Cache Management System for MoE Architecture

Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu, Xuhong Wang

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient Computing, ML System, KV Cache Management, Mixture of Experts, Routing, Cache Compression, Cache Eviction

TL;DR: We develop PiKV, an efficient MoE KV Cache Management System.

Abstract: As large-scale language models continue to scale up in both size and context length, the memory and communication cost of key-value (KV) cache storage has become a major bottleneck in multi-GPU and multi-node inference. While MoE-based architectures sparsify computation across experts, the corresponding KV caches remain dense and globally synchronized, resulting in significant overhead. We introduce \textbf{PiKV}, a parallel and distributed KV cache serving framework tailored for MoE architecture. PiKV leverages \textit{expert-sharded KV storage} to partition caches across GPUs, \textit{PiKV routing} to reduce token-to-KV access, and a \textit{PiKV Scheduling} to adaptively retain query-relevant entries. To further reduce memory usage, PiKV integrates \textit{PiKV Compression} modules the caching pipeline for acceleration. PiKV is recently publicly available as an open-source software library: {https://github.com/NoakLiu/PiKV}. Experiments details is recorded at: {https://github.com/NoakLiu/PiKV/blob/main/downstream_tasks/README.md}. We also have PiKV integrated with Nvidia kvpress for acceleration, details see {https://github.com/NoakLiu/PiKVpress}. PiKV is still a living project, aiming to become a comprehesive KV Cache management system for MoE Architectures.

Submission Number: 146

Loading