HydraCache: LLM Inference Prefill Parallelization through Distributed Cache Blending

Published: 22 Nov 2025, Last Modified: 07 May 2026Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’25)EveryoneCC BY 4.0
Abstract: The prefill phase of Large Language Model (LLM) inference, where the input prompt is processed to generate a Key-Value (KV) cache, is a critical latency bottleneck for input sequences. Existing serving architectures face a trade-off: Data Parallelism (DP) offers flexibility but cannot accelerate a single long prompt, while Tensor Parallelism (TP) parallelizes prefill but at the cost of rigid resource allocation and constant communication overhead at each layer. We introduce HydraCache, a system that resolves this problem by enabling a cluster of independent, data-parallel model replicas to collaborate on-demand to parallelize the prefill of a single long prompt. Our core contribution is DistBlendAttention, a lightweight mechanism that fuses distributed KV caches with minimal communication, avoiding the prohibitive overheads of both TP and traditional Sequence Parallelism. Our evaluation shows that HydraCache significantly reduces Time-To-First-Token (TTFT) up to 7x for requests and enables flexible, SLO-aware serving.
Loading