Keywords: cloud-edge collaborative inference, efficient inference, large language model
Abstract: Large language models (LLMs) demonstrate exceptional performance across various natural language processing tasks but face significant computational and memory constraints, making direct deployment on resource-limited edge devices impractical. To address this challenge, we propose CLEAR, a cost-aware edge-cloud collaborative inference framework that efficiently integrates cloud-based LLMs with small language models (SLMs) running on edge devices. CLEAR introduces a cost-aware router that dynamically evaluates SLM-generated outputs and selectively routes low-quality output to cloud-based LLMs for refinement, balancing quality and computational efficiency. The framework incorporates two key innovations: KV cache management system and reinforcement learning-based router training. The KV cache management system prevents cache eviction and minimizes redundant computations by limiting concurrent cloud requests and optimizing retrieval efficiency. Additionally, the router is trained using reinforcement learning to make adaptive routing decisions that minimize cloud usage while maintaining output quality. Our experimental results demonstrate that CLEAR significantly reduces inference cost and latency while maintaining high response quality, outperforming existing cloud-edge collaborative inference methods. Specifically, it achieves a cost reduction of 46% while achieving similar performance or a performance improvement of 15% with similar inference cost. These findings highlight the potential of CLEAR as an efficient and scalable solution for real-time edge-cloud inference applications.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8066
Loading