TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs

TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs

ACL ARR 2025 February Submission778 Authors

11 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Speculative decoding (SD) accelerates the decoding stage by speculating multiple next tokens with a small draft model, which is, in turn, verified by the target model in parallel. Despite its success in LLM inference acceleration, SD largely remains unexplored for Large Vision Language Models (LVLMs), an advanced class of LLMs that can handle multimodal prompts consisting of text and image tokens. To bridge this gap, we first evaluate a comprehensive scenario based on real-world deployments of LVLM SD. We observe that drafting with and without image tokens using a small draft model exhibits scenario-specific performance fluctuations. Motivated by this, we propose Test-time Adaptive Batched Ensemble Drafting, a fully training-free yet effective SD method for LVLMs. Our method leverages multiple drafting methods via batch inference. It dynamically weights these drafts based on their deviation from the target model’s previous outputs. To further enhance its extensibility at negligible cost, we incorporate alternative drafting strategies, such as image captioning and pooling. Our method achieves an average speedup of 1.8x while maintaining robustness across diverse input scenarios. Since our method relies solely on the draft model without incurring additional costs, it is fully compatible with existing LVLM acceleration techniques and can be seamlessly integrated into them. To ensure reproducibility, we open-source our code and custom-trained draft LVLMs.

Paper Type: Long

Research Area: Generation

Research Area Keywords: inference methods

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 778

Loading