SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices

Xiangwen Zhuge; Xu Shen; Zeyu Wang; Fan Dang; Xuan Ding; Danyang Li; Yahui Han; Tianxiang Hao; Zheng Yang

SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices

Xiangwen Zhuge, Xu Shen, Zeyu Wang, Fan Dang, Xuan Ding, Danyang Li, Yahui Han, Tianxiang Hao, Zheng Yang

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Inference Acceleration, Offloading, Speculative Decoding

Abstract: Efficient LLM inference on resource-constrained devices (i.e., PCs with a single commodity GPU) presents significant challenges in compute and memory utilization. Due to limited GPU memory, existing systems offload model weights to CPU memory, incurring substantial I/O overhead between the CPU and GPU. This leads to two major inefficiencies: (1) GPU cores are underutilized, often remaining idle while waiting for data to be loaded; and (2) GPU memory has a low impact on performance, as reducing its capacity has minimal effect on overall throughput. In this paper, we propose SpecOffload, a high-throughput inference engine that embeds speculative decoding into offloading. Our key idea is to unlock latent GPU resources for storing and executing a draft model used for speculative decoding, thus accelerating inference at near-zero additional cost. To support this, we carefully orchestrate the interleaved execution of target and draft models in speculative decoding within the offloading pipeline, and propose a planner to manage tensor placement and select optimal parameters. Compared with the best baseline, SpecOffload improves GPU core utilization by 4.49× and boosts inference throughput by 2.54×. Anonymous repo is at https://anonymous.4open.science/r/SpecOffload-F3F2/.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 15033

Loading