Online Dense Video Captioning with Factorized Action Object Retrieval

TMLR Paper7692 Authors

26 Feb 2026 (modified: 16 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Dense video captioning presents the dual challenge of temporally localizing events and generating descriptive captions within long videos. However, existing methods often struggle to handle evolving contexts in streaming settings or depend on static, global retrieval mechanisms. To address these limitations, we introduce a novel framework that embeds a dynamic, factorized retrieval mechanism directly into a causally-aware video processing backbone. Unlike approaches utilizing static global retrieval, our method dynamically retrieves concise action and object phrases at each timestep as the video streams. These retrieved phrases are integrated into a causal, autoregressive transformer, enriching the video representation to enhance the text decoder. Furthermore, to mitigate the scarcity of densely annotated video data, we introduce an image-based simulated video pretraining strategy. Experiments on the ViTT, YouCook2, and ActivityNet benchmarks demonstrate that our model significantly outperforms existing global and online methods.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Wei_Liu3
Submission Number: 7692
Loading