Enhancing On-Device LLM Inference with Historical Cloud-Based LLM Interactions

Yucheng Ding, Chaoyue Niu, Fan Wu, Shaojie Tang, Chengfei Lyu, Guihai Chen

Published: 01 Jan 2024, Last Modified: 18 Mar 2025KDD 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Many billion-scale large language models (LLMs) have been released for resource-constraint mobile devices to provide local LLM inference service when cloud-based powerful LLMs are not available. However, the capabilities of current on-device LLMs still lag behind those of cloud-based LLMs, and how to effectively and efficiently enhance on-device LLM inference becomes a practical requirement. We thus propose to collect the user's historical interactions with the cloud-based LLM and build an external datastore on the mobile device for enhancement using nearest neighbors search. Nevertheless, the full datastore improves the quality of token generation at the unacceptable expense of much slower generation speed. To balance performance and efficiency, we propose to select an optimal subset of the full datastore within the given size limit, the optimization objective of which is proven to be submodular. We further design an offline algorithm, which selects the subset after the construction of the full datastore, as well as an online algorithm, which performs selection over the stream and can be flexibly scheduled. We theoretically analyze the performance guarantee and the time complexity of the offline and the online designs to demonstrate effectiveness and scalability. We finally take three ChatGPT related dialogue datasets and four different on-device LLMs for evaluation. Evaluation results show that the proposed designs significantly enhance LLM performance in terms of perplexity while maintaining fast token generation speed. Practical overhead testing on the smartphone reveal the efficiency of on-device datastore subset selection from memory usage and computation overhead.