Keywords: Semantic Cache, Key-Value Cache, Large Language Model
Abstract: Large Language Models (LLMs) for multi-turn conversations suffer from inefficiency: semantically similar queries across different user sessions trigger redundant computation and duplicate memory-intensive Key-Value (KV) caches. Existing optimizations such as prefix caching overlook semantic similarities, while typical semantic caches either ignore conversational context or are not integrated with low-level KV cache management.
We propose SmartCache, a system-algorithm co-design framework that tackles this inefficiency by exploiting semantic query similarity across sessions. SmartCache leverages a Semantic Forest structure to hierarchically index conversational turns, enabling efficient retrieval and reuse of responses only when both the semantic query and conversational context match.
To maintain accuracy during topic shifts, it leverages internal LLM attention scores—computed during standard prefill—to dynamically detect context changes with minimal computational overhead. Importantly, this semantic understanding is co-designed alongside the memory system: a novel two-level mapping enables transparent cross-session KV cache sharing for semantically equivalent states, complemented by a semantics-aware eviction policy that significantly improves memory utilization. This holistic approach significantly reduces redundant computations and optimizes GPU memory utilization.
The evaluation demonstrates SmartCache's effectiveness across multiple benchmarks. On the CoQA and SQuAD datasets, SmartCache reduces KV cache memory usage by up to $59.1\%$ compared to prefix caching and $56.0\%$ over semantic caching, while cutting Time-to-First-Token (TTFT) by $78.0\%$ and $71.7\%$, respectively. It improves answer quality metrics, achieving $39.9\%$ higher F1 and $39.1\%$ higher ROUGE-L for Qwen-2.5-1.5B on CoQA. The Semantic-aware Tiered Eviction Policy (STEP) outperforms LRU/LFU by $29.9\%$ in reuse distance under skewed workloads.
Primary Area: Infrastructure (e.g., libraries, improved implementation and scalability, distributed solutions)
Submission Number: 2494
Loading