Fast Large Language Model Collaborative Decoding via Speculation

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose a new framework that can accelerate any form of LLM collaborative decoding—including model ensembling, contrastive decoding, and decoding-time realignment—without compromising performance.
Abstract: Large Language Model (LLM) collaborative decoding techniques improve output quality by combining the outputs of multiple models at each generation step, but they incur high computational costs. In this paper, we introduce **Collaborative decoding via Speculation (CoS)**, a novel framework that accelerates collaborative decoding without compromising performance. Inspired by Speculative Decoding—where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel, our approach builds on two key insights: (1) the verification distribution can be the combined distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency. We generalize this method to collaboration among *n* models and theoretically prove that CoS is never slower than standard collaborative decoding, typically achieving faster speed. Extensive experiments demonstrate CoS is **1.11x–2.23x** faster than standard collaborative decoding without compromising generation quality. Our code is available at https://github.com/Kamichanw/CoS/.
Lay Summary: Large language models (LLMs), like ChatGPT, generate responses by predicting one word (or *token*) at a time based on the input. A natural idea is that instead of using only one LLM to guess the next token, we can combine the guesses from several LLMs to get more accurate and reliable results. We refer to this class of methods as *LLM collaborative decoding*. However, since it needs to run several models for each token, the time it takes becomes *n* times longer, which makes it hard to use in real situations. To fix this problem, we propose a new framework: **Collaborative Decoding via Speculation (CoS)**. CoS can speed up any type of collaborative decoding—such as model ensemble, contrastive decoding, or decoding-time realignment—while still keeping the same high-quality output. Also, CoS does not need any training, added parameters, or extra calculation. This means it can be used directly to replace current ways of doing LLM collaborative decoding. Because of this, CoS has strong potential and value for real-world use.
Link To Code: https://github.com/Kamichanw/CoS/
Primary Area: Deep Learning->Large Language Models
Keywords: LLM Collaborative Decoding, Inference Acceleration, Speculative Decoding
Submission Number: 5862
Loading