Abstract: Speculative decoding (SD) has proven effective in accelerating LLM inference by quickly generating draft tokens and verifying them in parallel. However, SD remains largely unexplored for Large Vision Language Models (LVLMs), advanced LLMs capable of processing both image and text prompts. To address this gap, we first benchmark existing drafting methods for LVLMs across diverse scenarios and observe that methods using small draft models show scenario-specific performance fluctuations. Motivated by these findings, we propose Test-time Adaptive Batched Ensemble Drafting (TABED), which dynamically ensembles multiple drafts obtained via batch inference by leveraging measurable deviations from past ground truths available in the SD setting. Across diverse input scenarios, TABED achieves an average robust expected walltime speedup of 1.74x compared to standard decoding and a 5% improvement over individual drafting methods, though it does not incur additional training costs (i.e., training-free) and keeps ensembling costs negligible by sharing model parameters. To further enhance its extensibility, we also explore incorporating alternative drafting methods using image pooling and captioning. Our method maintains seamless compatibility with existing LVLM acceleration techniques, and we open-source custom-trained draft LVLMs to ensure reproducibility.
Paper Type: Long
Research Area: Generation
Research Area Keywords: inference methods
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 3890
Loading