MMInference: Accelerating Pre-filling for Long-Context Visual Language Models via Modality-Aware Permutation Sparse Attention
Abstract: The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy. Our code is available at https://ama.ms/MMInference.
Lay Summary: We propose MMInference, a dynamic sparse attention method that accelerates prefilling for long-context multi-modal inputs. Our analysis identifies a Grid pattern induced by the spatiotemporal locality of video inputs and highlights modality-specific sparsity in VLMs. MMInference uses a permutation-based approach to align with the Grid pattern and resolve modality boundaries, dynamically constructing sparse layouts via offline head-wise search. We also provide optimized GPU kernels for efficient sparse computation.
In tests using state-of-the-art AI systems on video and image tasks, MMInference sped up the input processing step by over 8× on million-token-long inputs—while still getting the right answers. This helps bring AI models closer to real-world use in areas like long-video understanding, AI tutoring, and multi-modal research tools.
Link To Code: https://aka.ms/MMInference
Primary Area: Deep Learning->Large Language Models
Keywords: Visual Language Model, LLMs Inference, Long-Context LLMs, Dynamic Sparse Attention, Efficient Inference
Submission Number: 9774
Loading