Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference

Nearchos Potamitis; Lars Henning Klein; Chongyang Xu; Attreyee Mukherjee; Bardia Mohammadi; Niket Tandon; Laurent Bindschaedler; Akhil Arora

Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference

Nearchos Potamitis, Lars Henning Klein, Chongyang Xu, Attreyee Mukherjee, Bardia Mohammadi, Niket Tandon, Laurent Bindschaedler, Akhil Arora

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, efficiency, reproducibility, client-side caching, cost-efficiency, carbon footprint, test-time compute

TL;DR: CacheSaver is a plug-and-play framework for high-level LLM inference optimization, cutting costs by \~25% and CO₂ by \~35%, with up to 60% savings in benchmarking and ablation tasks.

Abstract: Inference constitutes the majority of costs throughout the lifecycle of a large language model (LLM). While numerous LLM inference engines focusing primarily on low-level optimizations have been developed, there is a scarcity of non-intrusive client-side frameworks that perform high-level optimizations. In this paper, we introduce CacheSaver, a modular, plug-and-play, and asynchronous framework that facilitates high-level inference optimizations, thereby integrating cleanly into existing systems without requiring changes to the end-user application logic or the underlying LLM. The key novelty is a namespace-aware list-valued cache that ensures statistical integrity of LLM responses by generating independent and identically distributed responses within a namespace as well as ensuring reproducibility. Moreover, as a direct consequence of operating at a high level, CacheSaver supports both local and online models. We conduct extensive experiments with five representative state-of-the-art reasoning strategies, five diverse benchmark tasks, and three different LLMs. On average across all methods, tasks, and LLMs, CacheSaver reduces cost by approximately 25% and CO2 emissions by approximately 35%. Notably, CacheSaver excels in practical machine learning scenarios such as benchmarking across multiple methods or conducting ablation analysis of a specific method, obtaining substantial cost and carbon footprint reduction of approximately 60%. CacheSaver is publicly available at https://github.com/au-clan/cachesaver

Submission Number: 159

Loading