Slow-Vision, Fast-Language: Training-Free Efficient Inference for dMLLMs

Slow-Vision, Fast-Language: Training-Free Efficient Inference for dMLLMs

ACL ARR 2026 January Submission1047 Authors

27 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal diffusion Large Language Models，Efficient Inference，Training-free Methods

Abstract: Diffusion-based Multimodal Large Language Models (dMLLMs) represent a promising frontier in generative AI, yet their practical deployment is severely hindered by the computational burden of iterative denoising on high-resolution visual sequences. In this work, we identify a fundamental asymmetry: visual tokens exhibit high spatial redundancy and tend to aggregate semantically, whereas text tokens drive the dynamic evolution of reasoning. Challenging the coarse "all-or-nothing" approach of existing token pruning methods, we propose Slow-Vision, Fast-Language (SVFL), a training-free acceleration paradigm for dMLLMs. SVFL maintains the complete visual panorama in intermittent "slow" layers, allowing only specific visual details to be dynamically summoned by text attention for efficient interaction in frequent "fast" layers. Extensive experiments on the state-of-the-art dMLLM, LLaDA-V, demonstrate that SVFL achieves significant inference acceleration with negligible performance degradation. Furthermore, we verify the framework's universality on the autoregressive \textbf{LLaVA-1.5}, confirming its effectiveness across diverse generative paradigms.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, vision question answering, video processing

Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 1047

Loading