Keywords: Vision-Language Models, Parallel Encoding, Large multimodal models, Latency-Constrained Inference, Long Context, Efficient Inference, Local and Global Attention
TL;DR: PEVLM is a fast, fine-tuning-free method for long video understanding in VLMs, achieving up to 7.47× speedup and better accuracy by using parallel encoding with sequential position embeddings.
Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities in multimodal understanding and generation tasks. However, their application to long video understanding remains hindered by the quadratic complexity of standard attention mechanisms. In this work, we introduce \textbf{PEVLM}, a fine-tuning-free parallel encoding method designed to enhance the prefilling efficiency of VLMs in long video scenarios. To the best of our knowledge, this is the first work to adapt parallel encoding to VLMs. PEVLM partitions the input video into context blocks with a shared sink block, while preserving sequential position embeddings to align the attention score distribution with that of Full-Attention. This design reduces the complexity of attention from $O((T \times N)^2)$ to $O(T \times N)$ where $T$ is the number of frames and $N$ the number of tokens per frame, with minimal loss in accuracy.
Extensive experiments across multiple state-of-the-art models and benchmarks demonstrate that PEVLM consistently outperforms existing parallel encoding approaches, achieving up to \textbf{7.47x} speedup in attention computation and reducing end-to-end latency by \textbf{44\%} to \textbf{50\%}. Remarkably, PEVLM not only maintains high accuracy, but in some settings even surpasses Full-Attention performance. Under strict latency constraints, it achieves substantial gains, improving accuracy from \textbf{23.26\%} to \textbf{61.03\%}. These results underscore the effectiveness of PEVLM for low-latency, long-context video understanding, making it a promising solution for real-world applications.
Primary Area: generative models
Submission Number: 8905
Loading