ShadowSpec: Towards Zero Speculation Overhead for Substitute Speculative Decoding

Published: 01 Jun 2026, Last Modified: 03 Jun 2026AdaptFM PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: speculative decoding, offlaoding, Consumer GPU
TL;DR: Accelerating LLM inference on consumer GPUs by overlapping draft model computation with target model parameter loading.
Abstract: Deploying large language models (LLMs) on memory constrained consumer GPUs requires offloading model parameters to CPU memory, where PCIe bandwidth becomes the dominat in- ference bottleneck. While speculative decoding (SD) reduces the number of target model invo- cations, its sequential draft-then-verify pipeline leaves the GPU idle during parameter loading, limiting throughput in offloading environments. We identify that draft model computation and tar- get model parameter loading are heterogeneous, non-contending operations that can be overlapped without resource conflict. Building on this obser- vation, we propose two complementary system- level strategies to overlap draft tree extension with target model parameter loading, converting idle GPU cycles into productive draft computation. Experiments on consumer GPU demonstrate a 17–54% end-to-end speedup over the SubSpec baseline.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 133
Loading