Abstract: The prefill phase of Large Language Model (LLM) inference, where
the input prompt is processed to generate a Key-Value (KV) cache,
is a critical latency bottleneck for input sequences. Existing serving
architectures face a trade-off: Data Parallelism (DP) offers flexibility
but cannot accelerate a single long prompt, while Tensor Parallelism
(TP) parallelizes prefill but at the cost of rigid resource allocation
and constant communication overhead at each layer. We introduce
HydraCache, a system that resolves this problem by enabling a
cluster of independent, data-parallel model replicas to collaborate
on-demand to parallelize the prefill of a single long prompt. Our core
contribution is DistBlendAttention, a lightweight mechanism that
fuses distributed KV caches with minimal communication, avoiding
the prohibitive overheads of both TP and traditional Sequence
Parallelism. Our evaluation shows that HydraCache significantly
reduces Time-To-First-Token (TTFT) up to 7x for requests and
enables flexible, SLO-aware serving.
Loading