In-batch Ensemble Drafting: Toward Fast and Robust Speculative Decoding for Multimodal Language Models

Minjae Lee; Wonjun Kang; Minghao Yan; Christian Classen; Hyung Il Koo; Kangwook Lee

In-batch Ensemble Drafting: Toward Fast and Robust Speculative Decoding for Multimodal Language Models

Minjae Lee, Wonjun Kang, Minghao Yan, Christian Classen, Hyung Il Koo, Kangwook Lee

17 Sept 2024 (modified: 14 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speculative decoding, Large language model, Vision language model, Inference Acceleration

TL;DR: Systematically explore speculative decoding for MLLMs and propose new method

Abstract: Multimodal Large Language Models (MLLMs) have emerged as powerful tools for processing modalities beyond text by combining a visual encoder with Large Language Models (LLMs) to incorporate visual context. This integration, however, leads to higher computational costs during LLM inference, specifically in the Prefill and Decoding stages. Existing MLLM acceleration methods primarily focus on reducing the cost of long prefills caused by visual context, but this approach has limitations: (1) From a latency perspective, it mainly benefits the prefill stage, offering minimal improvements for decoding. (2) It does not guarantee output distributions that are identical to those of the original MLLM. To ensure identical output distribution while mitigating decoding latency, we focus on speculative decoding (SD)—an acceleration technique that uses a smaller draft model verified by a larger model. Despite its importance for LLM acceleration, SD's application to MLLMs remains largely unexplored, even though decoding constitutes a significant portion of MLLM inference latency. We investigate various drafting techniques—multimodal, text-only, image-pooling, and caption-based—for multimodal scenarios and analyze their integration with MLLMs. Building on these insights, we propose In-batch Ensemble Drafting, which combines probability distributions from multiple drafting methods via batch inference during the SD draft phase. This approach requires no additional model parameters, incurs minimal overhead, and significantly increases the likelihood of draft tokens passing verification, thereby enhancing performance and robustness across diverse input scenarios.

Supplementary Material: zip

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1306

Loading