MAPSparse: Accelerating Pre-filling for Long-Context Visual Language Models via Modality-Aware Permutation Sparse Attention

Published: 06 Mar 2025, Last Modified: 06 Mar 2025ICLR 2025 FM-Wild WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Visual Language Model, LLMs Inference, Long-Context LLMs, Dynamic Sparse Attention, Efficient Inference
Abstract: The integration of long-context capabilities with visual understanding opens up new possibilities for Vision Language Models (VLMs). However, the quadratic attention complexity during the prefilling stage remains a major bottleneck, restricting wide deployment in real-world applications. To address this, we propose MAPSparse (Modality-Aware Permutation Sparse Attention), a dynamic sparse attention method that accelerates the pre-filling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse patterns, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundaries issue. By offline searching the optimal sparse patterns for each head, MAPSparse constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MAPSparse integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks—including Video QA, Captioning, Vision-NIAH, and Mix Modality-NIAH—with state-of-the-art long-context VLMs (LongVila and Llava-Video) show that MAPSparse accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining competitive performance.
Submission Number: 39
Loading