SLiM: Speculative Decoding with Hypothesis ReductionDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: We propose a speculative decoding enhancement with a lightweight verifier for computation reductions
Abstract: Speculative decoding has emerged as a prominent alternative to autoregressive decoding for expediting inference in large language models (LLMs). However, prevailing assumptions often focus solely on latency reduction, neglecting the computational expenses. In this paper, we present \textbf{S}peculate \textbf{L}ess, val\textbf{i}date \textbf{M}ore (SLiM), a speculative decoding enhancement to reduce the speculation set while validating more effective tokens. SLiM is designed to mitigate LLMs' computation costs associated with the token verification by introducing hypothesis reduction based on a fast posterior estimation. It consistently surpasses counterparts lacking cost reduction across a spectrum from CPU to GPU. Our evaluation with diverse conversational datasets shows that SLiM can achieve a substantial $70\%$ reduction in FLOPs while generating more effective predictions on top of prior arts.
Paper Type: long
Research Area: Generation
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
0 Replies

Loading