Abstract: Multimodal generative models have demonstrated remarkable capabilities across diverse domains, from visual understanding and image generation to video processing, audio synthesis, and embodied control. These capabilities, however, incur substantial inference overhead due to autoregressive decoding or iterative generation, which is further compounded by modality-specific challenges such as extensive visual token redundancy, strict real-time latency constraints in robotic control, and prolonged sequential generation in text-to-image synthesis. Speculative decoding has emerged as a promising paradigm to accelerate inference without degrading output quality, yet existing surveys remain focused on text-only large language models. In this survey, we provide a systematic and comprehensive review of speculative decoding methods for multimodal models, spanning Vision–Language, Vision–Language–Action, Video–Language, Speech, Text-to-Image (Vision Auto-Regressive), and Diffusion models. We organize the literature into a unified taxonomy with two primary axes, covering the draft generation stage and the verification and acceptance stage, complemented by an analysis of inference framework support. Through this taxonomy, we identify recurring cross-modal design patterns, including token compression, KV cache optimization, target-informed transfer, drafter-target alignment, verification cost reduction, relaxed acceptance, and verify-to-draft feedback, and examine how successful techniques transfer across modalities. We further provide a systematic comparison of existing methods under both self-reported and standardized benchmarking settings. Finally, we discuss open challenges and outline future directions. We hope this survey can serve as a valuable resource for researchers and practitioners working on accelerating multimodal inference.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Yingce_Xia1
Submission Number: 8485
Loading