TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding

TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding

ACL ARR 2025 February Submission4378 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract:

We propose Tetris, a novel method that optimizes the $\textit{total throughput}$ of batch speculative decoding in multi-request settings. Unlike existing methods that optimize for a single request or a group of requests as a whole, Tetris actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel, resulting in fewer rejected tokens and hence less wasted computing resources. Such an effective resource utilization to achieve fast inference in large language models (LLMs) is especially important to service providers with limited inference capacity. Compared to baseline speculative decoding, Tetris yields a consistently higher acceptance rate and more effective utilization of the limited inference capacity. We show theoretically and empirically that Tetris outperforms baseline speculative decoding and existing methods that dynamically select draft tokens, leading to a more efficient batch inference in LLMs.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Speculative Decoding, Resource-constrained Settings, Large Language Models

Contribution Types: NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English

Submission Number: 4378

Loading