VideoMolmo: Spatio-Temporal Grounding meets Pointing

VideoMolmo: Spatio-Temporal Grounding meets Pointing

ICLR 2026 Conference Submission16980 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video-LMM, Grounding, Pointing

TL;DR: We present VideoMolmo, a video-LMM for spatio-temporal localization. With temporal attention, mask fusion, and a new 72k-video dataset plus VPoS-Bench, it achieves superior accuracy and reasoning over prior methods.

Abstract: Spatio-temporal localization—the ability to identify both the position and temporal evolution of objects—is essential for applications from cell tracking to autonomous navigation. Recent Video Large Multimodal Models (Video-LMMs) show promise but remain limited by coarse predictions, heavy reliance on dense mask optimization, and limited interpretability. We introduce VideoMolmo, a two-stage framework that grounds objects through point-based localization. Rather than directly predicting dense masks, VideoMolmo first produces precise points as lightweight, interpretable anchors, which are then used for downstream tasks including referring segmentation, video object segmentation, and counting. By decoupling localization from task execution, our approach provides more robust and transparent reasoning. Built on Molmo, our framework incorporates a temporal attention module for cross-frame reasoning and introduces a novel bidirectional temporal mask fusion strategy, enabling coherent point propagation and accurate segmentation. To facilitate training and evaluation, we release a large-scale spatio-temporal pointing dataset of 72k video–caption pairs with 100k annotated points and curate VPoS-Bench, a challenging benchmark spanning five real-world domains. Experiments show that VideoMolmo outperforms existing approaches, with gains of $5.4$ percentage points (pp) on VPoS-Bench and $9.5$ pp on MeViS. This highlights the effectiveness of point-based representations as a foundation for interpretable, fine-grained reasoning in dynamic visual environments.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 16980

Loading