DSSM-KG: Dual-Stream State-Space Modeling with Adaptive Knowledge Injection for Video Captioning

Published: 2025, Last Modified: 12 Nov 2025ICMR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video captioning aims to generate natural language descriptions of video content. Recent methods extract temporal and spatial information separately and use dataset-specific prior knowledge to enhance caption quality. However, they may be inadequate in joint spatiotemporal modeling and lack the utilization of commonsense knowledge, making it difficult to fully understand the video. To address these issues, this paper proposes a dual-stream state-space model (DSSM-KG) based on cross-modal knowledge injection. Specifically, by integrating the heterogeneous Mamba with the Transformer in both parallel and sequential manners, we construct the spatially enhanced dual-stream state-space module (S-DSSM) and the temporally enhanced dual-stream state-space module (T-DSSM) to strengthen joint spatiotemporal modeling. Additionally, a knowledge graph that integrates both commonsense and dataset-specific information is constructed and adaptively injected into the decoder to furnish the model with extensive video-related knowledge. Experimental results indicate that the structural designs of DSSM-KG, together with the knowledge injection mechanism, demonstrate significant efficacy, yielding competitive performance on mainstream video captioning datasets such as MSVD and MSR-VTT.
Loading