Abstract: Large Language Models (LLMs) are powerful but computationally expensive, making them impractical for latency-sensitive or resource-constrained applications. This paper presents SLMCache, an adaptive caching framework that uses Small Language Models (SLMs) as semantic caches to reduce the frequency and cost of LLM invocations. Queries are first matched against a local vector store; if a semantically similar query is found, an SLM generates the response. Otherwise, the query is forwarded to the LLM, and its output is logged for future caching. The cache uses LRU and LFU eviction policies, and the SLM is periodically retrained using logged queries to expand its response coverage. Evaluated on the Bitext customer support dataset, SLMCache achieves up to 2.8× speedup and 10× lower GPU memory usage compared to LLM-only baselines, while maintaining high semantic fidelity. The framework is practical for edge deployment and significantly reduces the operational cost of LLM-based systems.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Efficient/Low-Resource Methods for NLP, Language Modeling, NLP Applications
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 98
Loading