Optimizing LLM Inference Offloading with Hierarchical Scheduling and Dynamic Sparsification

17 Sept 2025 (modified: 23 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Inference Offloading, Edge Computing
Abstract: Large language models (LLMs) power a new generation of applications. Serving them efficiently on edge remains a significant challenge due to high computational and memory costs. Current cloud-centric systems largely overlook the vast, cost-effective resources of distributed edge servers. In this paper, we introduce a novel inference offloading framework that distributes LLM workloads across a hybrid edge-cloud architecture to maximize performance and resource utilization. Our framework employs a Hierarchical Scheduling Architecture that decouples global, long-term resource planning from real-time, dynamic execution scheduling. At the kernel level, it uses Dynamic Attention Sparsification (DAS) to accelerate GPU computations by pruning redundant attention calculations. Experiments show that our hybrid approach improves overall system throughput by up to 1.86 times compared to a cloud-only baseline, effectively parallelizing workloads and introducing a scalable and robust paradigm for distributed LLM serving.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8403
Loading