SLM as an Adaptive Cache Layer for LLMs

SLM as an Adaptive Cache Layer for LLMs

ACL ARR 2025 July Submission98 Authors

22 Jul 2025 (modified: 01 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) are powerful but computationally expensive, making them impractical for latency-sensitive or resource-constrained applications. This paper presents SLMCache, an adaptive caching framework that uses Small Language Models (SLMs) as semantic caches to reduce the frequency and cost of LLM invocations. Queries are first matched against a local vector store; if a semantically similar query is found, an SLM generates the response. Otherwise, the query is forwarded to the LLM, and its output is logged for future caching. The cache uses LRU and LFU eviction policies, and the SLM is periodically retrained using logged queries to expand its response coverage. Evaluated on the Bitext customer support dataset, SLMCache achieves up to 2.8× speedup and 10× lower GPU memory usage compared to LLM-only baselines, while maintaining high semantic fidelity. The framework is practical for edge deployment and significantly reduces the operational cost of LLM-based systems.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Efficient/Low-Resource Methods for NLP, Language Modeling, NLP Applications

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 98

Loading