Hybrid SLM and LLM for Edge-Cloud Collaborative Inference

Zixu Hao, Huiqiang Jiang, Shiqi Jiang, Ju Ren, Ting Cao

Published: 2024, Last Modified: 25 Jan 2025EdgeFM@MobiSys 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Edge-Cloud collaboration for deep learning inference has been actively studied, to enhance the inference performance by leveraging both Edge and Cloud resources. However, traditional Edge-Cloud collaboration based on model partitioning or confidence score are not suitable in the LLM (large language models) era, because of its autoregressive generation and the generality across diverse tasks. This paper proposes a dynamic token-level Edge-Cloud collaboration for LLMs. A SLM (small language model) such as TinyLlama resides on the Edge devices, through token-level interaction with the Cloud-side LLMs during inference, approaching LLM quality with a controllable cost similar to SLM. Evaluation results show that our method can only use 25.8% LLM cost to achieve LLM-comparable quality on GSM8K task.