Compress-Then-Enhance: Spatiotemporal Graph-Guided Visual Tokenization for Multimodal Video LLMs

Shilv Cai

Published: 31 Aug 2025, Last Modified: 23 Sept 2025OpenReview Archive Direct UploadEveryoneCC BY-NC 4.0

Abstract: Multimodal Large Language Models (MLLMs) for video understanding are bottlenecked by redundant and weakly informative visual tokens, especially in long-form videos where naïve uniform sampling or global pooling erodes fine-grained spatiotemporal cues. We propose a compress-then-enhance pipeline that integrates Visual Token Compression, Spatiotemporal Graph reasoning, and Spatial Pooling to deliver compact yet information-rich representations aligned to a language interface. First, we perform token-efficient ingestion via adaptive Spatial Pooling and motion-aware pruning to discard low-saliency regions while preserving dynamic entities and interactions. We then construct a Spatiotemporal Graph whose nodes represent region-level tokens across frames and whose edges capture appearance affinity, motion continuity, and cross-attention priors. Message passing over this graph yields Visual Token Enhancement: tokens are reweighted and enriched with long-range temporal context and cross-object relations before being projected into the MLLM’s visual vocabulary. Finally, a token budget-aware router selects a minimal set of enhanced tokens per query, enabling scalable conditioning for instruction-tuned video LLMs. Across video QA, dense captioning, and long-horizon reasoning tasks, our method reduces visual tokens by 30–60% and improves end-to-end throughput, while maintaining or surpassing accuracy relative to strong baselines. Ablations confirm that (i) motion-guided compression preserves action semantics, (ii) graph-based enhancement recovers dependencies lost by compression, and (iii) query-conditioned routing further stabilizes long-context performance. This work demonstrates that principled Visual Token Compression combined with graph-driven Visual Token Enhancement provides an effective interface between high-bandwidth video streams and token-limited MLLMs, enabling efficient, fine-grained video understanding without sacrificing reasoning fidelity.