LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Large language model for long video language understanding
Abstract: Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose \textbf{LongVU}, a spatiotemporal adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.
Lay Summary: AI models—especially the new generation of multimodal large language models (MLLMs) that can understand both images and text—often stumble when dealing with long videos. It's like their "working memory" isn't large enough to hold all the visual information from an hour-long video. We wondered: how can we help these models grasp long video content without overwhelming them? To tackle this, we introduced LongVU, a new approach that adaptively compresses videos in both time and space. It identifies and reduces repetitive or less crucial visual information. If you ask a question about the video, our method cleverly prioritizes the details most relevant to your query, simplifying the rest, making the model easier to focus and respond accurately. Our approach allows these AI models to process a significantly larger number of frames from long videos with very little loss of important visual detail. It paves the way for future research in video compression tailored for MLLM-based applications, enabling more effective long-video, media, and streaming video understanding.
Link To Code: https://vision-cair.github.io/LongVU
Primary Area: Deep Learning->Foundation Models
Keywords: Long Video Understanding;Video-Language;Spatiotemporal
Submission Number: 9623
Loading