VideoMolmo: Spatio-Temporal Grounding Meets Pointing

ACL ARR 2026 January Submission8328 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video-LMM, Grounding, Pointing
Abstract: Spatio-temporal localization, identifying the position and temporal evolution of objects, is essential for applications from cell tracking to autonomous navigation. Video Large Multimodal Models show promise but remain limited by coarse predictions, reliance on dense mask optimization, and limited interpretability. We introduce VideoMolmo, a two-stage framework that grounds objects through point-based localization. Rather than predicting dense masks, VideoMolmo produces precise points as lightweight, interpretable anchors, then uses them for downstream tasks including referring segmentation, video object segmentation, and counting. By decoupling localization from task execution, we provide robust and transparent reasoning. Built on Molmo, VideoMolmo incorporates a novel temporal attention module for cross-frame reasoning and a bidirectional temporal mask fusion strategy, enabling coherent point propagation and accurate segmentation. For training and evaluation, we release a large-scale spatio-temporal pointing dataset of 72k video-caption pairs with 100k annotated points and curate VPoS-Bench, a challenging benchmark spanning five real-world domains. Experiments show that VideoMolmo outperforms existing approaches on the MeViS benchmark and achieves a gain of 5.4 percentage points on VPoS-Bench. This highlights the effectiveness of point-based representations as a foundation for interpretable, fine-grained reasoning in dynamic visual environments.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, video processing, spoken language grounding, cross-modal information extraction, cross-modal application
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 8328
Loading