SLiM: Speculative Decoding with Hypothesis Reduction

Anonymous

SLiM: Speculative Decoding with Hypothesis Reduction

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone

TL;DR: We propose a speculative decoding enhancement with a lightweight verifier for computation reductions

Abstract: Speculative decoding has emerged as a prominent alternative to autoregressive decoding for expediting inference in large language models (LLMs). However, prevailing assumptions often focus solely on latency reduction, neglecting the computational expenses. In this paper, we present \textbf{S}peculate \textbf{L}ess, val\textbf{i}date \textbf{M}ore (SLiM), a speculative decoding enhancement to reduce the speculation set while validating more effective tokens. SLiM is designed to mitigate LLMs' computation costs associated with the token verification by introducing hypothesis reduction based on a fast posterior estimation. It consistently surpasses counterparts lacking cost reduction across a spectrum from CPU to GPU. Our evaluation with diverse conversational datasets shows that SLiM can achieve a substantial $70\%$ reduction in FLOPs while generating more effective predictions on top of prior arts.

Paper Type: long

Research Area: Generation

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: English

0 Replies

Loading