STaR: Multi-Granular Spatio-Temporal Reasoning for Long-Form Dense Video Captioning

Yihao Wu, Chenhuan Cai, Liqi Yan, Huapeng Li, Jianhui Zhang, Jiahao Liu, Qifan Wang, Fangli Guan, Pan Li

Published: 2025, Last Modified: 09 May 2026ECAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Dense video captioning is crucial for enhancing video understanding in daily applications and presents a significant challenge in multimodal analysis. Existing methods often overlook video-to-dynamic-space mapping at varying scales, resulting in captions that lack specificity and remain overly general, failing to capture real-world physical detail. To address this limitation, we propose a multi-granularity Spatio-Temporal Reasoning (STaR) approach, which integrates: (i) efficient global feature integration to model long-term temporal dependencies, (ii) spatial attention mechanisms with position encoding to capture absolute spatial information, and (iii) cross-modal feature fusion to align and unify global, local, and spatial representations. Moreover, we enhance the framework using a Large Language Model (LLM) to improve the richness and naturalness of the generated descriptions. Comparative experiments have been conducted to evaluate the effectiveness of the proposed method on SoccerNet dataset. Experimental results demonstrate that our model effectively enhances localization accuracy and generates captions with superior temporal and spatial detail fidelity. The code is available at https://github.com/bread-555/STaR.
Loading