Selective Deferred Routing: Enabling Cost-Efficient Collaboration between Local SLMs and Remote LLMs
Keywords: LLM Routing, Cost-Efficiency, Local-Remote Collaboration
TL;DR: To achieve high performance with limited monetary cost with collaboration of local SLMs and remote LLMs.
Abstract: The rapid advancement of large language models (LLMs) has led to remarkable performance across diverse domains such as question answering, creative writing, programming, etc., making them indispensable assistants in daily life and work. Currently, LLM services are primarily accessed in two ways: (i) paid access to cloud-hosted LLMs, which are powerful but introduce nontrivial cost; and (ii) deployment of small language models (SLMs) on personal devices or small clusters, which, while less powerful, are sufficient for handling relatively simple tasks. To achieve a balanced trade-off between monetary cost and task performance, we propose Selective Deferred Routing, a paradigm that enables cost-efficient collaboration between local SLMs and remote LLMs. In this framework, a user request is first processed by the local SLM, which not only generates a preliminary response but also provides rich semantic representations of the request. A lightweight decider module then leverages this information to either adopt the initial response or route the request in a single step to the most suitable remote LLM for a higher-quality response. Extensive experiments on 5 LLMs and 3 datasets demonstrate that our approach consistently outperforms existing multi-LLM collaboration methods across different cost–performance trade-off preferences.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6798
Loading