TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs

TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs

ACL ARR 2025 May Submission3890 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Speculative decoding (SD) has proven effective in accelerating LLM inference by quickly generating draft tokens and verifying them in parallel. However, SD remains largely unexplored for Large Vision Language Models (LVLMs), advanced LLMs capable of processing both image and text prompts. To address this gap, we first benchmark existing drafting methods for LVLMs across diverse scenarios and observe that methods using small draft models show scenario-specific performance fluctuations. Motivated by these findings, we propose Test-time Adaptive Batched Ensemble Drafting (TABED), which dynamically ensembles multiple drafts obtained via batch inference by leveraging measurable deviations from past ground truths available in the SD setting. Across diverse input scenarios, TABED achieves an average robust expected walltime speedup of 1.74x compared to standard decoding and a 5% improvement over individual drafting methods, though it does not incur additional training costs (i.e., training-free) and keeps ensembling costs negligible by sharing model parameters. To further enhance its extensibility, we also explore incorporating alternative drafting methods using image pooling and captioning. Our method maintains seamless compatibility with existing LVLM acceleration techniques, and we open-source custom-trained draft LVLMs to ensure reproducibility.

Paper Type: Long

Research Area: Generation

Research Area Keywords: inference methods

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 3890

Loading