Generative Caching for Structurally Similar Prompts and Responses

Sarthak Chakraborty; Suman Nath; Xuchao Zhang; Chetan Bansal; Indranil Gupta

Generative Caching for Structurally Similar Prompts and Responses

Sarthak Chakraborty, Suman Nath, Xuchao Zhang, Chetan Bansal, Indranil Gupta

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Generative Caching, Hit Rate Optimization, Repeatable Workflows, LLM Serving Systems

Abstract: Large Language Models (LLMs) are increasingly being used to plan, reason, and execute tasks across diverse scenarios. In use cases like repeatable workflows and agentic settings, prompts are often reused with minor variations while having a similar structure for recurring tasks. This opens up opportunities for caching. However, exact prompt matching fails on such structurally similar prompts, while semantic caching may produce incorrect responses by ignoring critical differences. To address this, we introduce GenCache, a generative cache that produces variation-aware responses for structurally similar prompts. GenCache identifies reusable response patterns across similar prompt structures and synthesizes customized outputs for new requests. We show that GenCache achieves 83\% cache hit rate, while having minimal incorrect hits on datasets without prompt repetition. In agentic workflows, it improves cache hit rate by $\sim$20\% and reduces end-to-end execution latency by $\sim$34\% compared to standard prompt matching.

Supplementary Material: zip

Primary Area: Infrastructure (e.g., libraries, improved implementation and scalability, distributed solutions)

Submission Number: 25211

Loading