MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models
Abstract: Speculative decoding significantly accelerates language model inference by
enabling a lightweight draft model to propose multiple tokens that a larger
target model verifies simultaneously. However, applying this technique to
vision-language models (VLMs) presents two fundamental challenges: small
language models that could serve as efficient drafters lack the architectural
components to process visual inputs, and their token predictions fail to match
those of VLM target models that consider visual context. We introduce
Multimodal Adaptation and Self-Data Distillation for
Speculative Decoding of Vision-Language Models (MASSV), which
transforms existing small language models into effective multimodal drafters
through a two-phase approach. MASSV first connects the target VLM's vision
encoder to the draft model via a lightweight trainable projector, then applies
self-distilled visual instruction tuning using responses generated by the target
VLM to align token predictions. Comprehensive experiments across the Qwen2.5-VL
and Gemma3 model families demonstrate that MASSV increases accepted length by up
to 30% and delivers end-to-end inference speedups of up to 1.46x
compared to conventional text-only drafting baselines on visually-grounded
tasks.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: speculative decoding, llm, multimodal, vision, inference
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Keywords: speculative decoding, llm, multimodal, vision, inference
Submission Number: 4975
Loading