Track: Main paper track (up to 5 pages excluding references and appendix)
Keywords: Speculative decoding, Multimodal Large Language Models, Large Vision Language Models
TL;DR: Systematically explore speculative decoding for LVLMs and propose new method in-batch ensemble drafting
Abstract: Despite the success of Speculative Decoding (SD) in LLM inference acceleration, it largely remains unexplored for Large Vision Language Models (LVLMs), an advanced class of LLMs that can handle multimodal prompts consisting of text and image tokens. To bridge this gap, we first conduct a comprehensive benchmarking study, focusing on the effectiveness of various drafting methods. We observe that various drafting methods have their own advantages, and none of them consistently outperforms the others. Motivated by this observation, we propose **In-batch Ensemble Drafting (IbED)**, a simple yet effective SD method for LVLMs. IbED leverages multiple drafting methods without incurring much additional latency via batch inference and, compared to multimodal drafting, consistently demonstrates significant improvements in block efficiency, averaging 6% (with a maximum of 23%) across a wide range of datasets.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Minjae_Lee2
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 24
Loading