Keywords: speculative decoding, offlaoding, Consumer GPU
TL;DR: Accelerating LLM inference on consumer GPUs by overlapping draft model computation with target model parameter loading.
Abstract: Deploying large language models (LLMs) on
memory constrained consumer GPUs requires
offloading model parameters to CPU memory,
where PCIe bandwidth becomes the dominat in-
ference bottleneck. While speculative decoding
(SD) reduces the number of target model invo-
cations, its sequential draft-then-verify pipeline
leaves the GPU idle during parameter loading,
limiting throughput in offloading environments.
We identify that draft model computation and tar-
get model parameter loading are heterogeneous,
non-contending operations that can be overlapped
without resource conflict. Building on this obser-
vation, we propose two complementary system-
level strategies to overlap draft tree extension with
target model parameter loading, converting idle
GPU cycles into productive draft computation.
Experiments on consumer GPU demonstrate a
17–54% end-to-end speedup over the SubSpec
baseline.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 133
Loading