Cost-efficient Collaboration between On-device and Cloud Language Models

Avanika Narayan; Dan Biderman; Sabri Eyuboglu; Avner May; Scott Linderman; James Zou; Christopher Re

Cost-efficient Collaboration between On-device and Cloud Language Models

Avanika Narayan, Dan Biderman, Sabri Eyuboglu, Avner May, Scott Linderman, James Zou, Christopher Re

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We demonstrate a simple communication protocol between an on-device and cloud-hosted LLM reduces cost 5.7x while retaining 97.9% of cloud-only accuracy.

Abstract: We investigate an emerging setup in which a small, on-device language model (LM) with access to local data collaborates with a frontier, cloud-hosted LM to solve real-world tasks involving financial, medical, and scientific reasoning over long documents. *Can a local-remote collaboration reduce cloud inference costs while preserving quality?* First, we consider a naïve collaboration protocol, coined MINION, where the local and remote models simply chat back and forth. Because only the local model ingests the full context, this protocol reduces cloud costs by 30.4x, but recovers only 87% of the performance of the frontier model. We identify two key limitations of this protocol: the local model struggles to (1) follow the remote model's multi-step instructions and (2) reason over long contexts. Motivated by these observations, we propose MINIONS, a protocol in which the remote model decomposes the task into easier subtasks over shorter chunks of the document, that are executed locally in parallel. MINIONS reduces costs by 5.7× on average while recovering 97.9% of the remote-only performance. Our analysis reveals several key design choices that influence the trade-off between cost and performance in local-remote systems.

Lay Summary: Many organizations wish to apply Large Language Models (LLMs) to large volumes of text. Current frontier models are extremely capable, but are expensive. Smaller language models are getting much better, and so does the hardware or computers and phones which can now run them. But these models on their own lag behind the frontier models and struggle even with simple tasks. We propose a way in which frontier models in the cloud can operate smaller language models on devices, to cut down on cloud costs and preserve accuracy. We show two protocols that achieve that, the strongest of which, called MinionS, cuts down costs by 5.7X and maintains 97.9% of the frontier model accuracy.

Link To Code: https://github.com/HazyResearch/Minions

Primary Area: Deep Learning->Large Language Models

Keywords: Local-remote collaboration, reasoning

Submission Number: 9367

Loading