DVC-SGRL: Adapting MLLMs for Temporally Precise Dense Video Captioning via Semantically Guided Reinforcement Learning
Keywords: Dense video caption, Multimodal large language model, Reinforcement learning
Abstract: Dense Video Captioning (DVC) aims to localize and describe multiple events within untrimmed videos. While methods using Multimodal Large Language Models (MLLMs) show promise, their ability to precisely localize event boundaries remains a significant limitation. This weakness stems from a reliance on supervised fine-tuning with cross-entropy loss, which frames timestamp prediction as a classification task. In this formulation, the model learns only to match timestamps exactly, with no awareness of how close a prediction is to the ground truth. This limits its ability to interpret time as a continuous signal, hindering accurate event localization. To address this, we introduce DVC-SGRL, a reinforcement learning framework that provides semantically guided temporal supervision, enabling general-purpose MLLMs to be successfully adapted for dense video captioning. Our approach leverages the model's powerful captioning abilities to improve its weaker temporal localization through a novel matching mechanism and corresponding rewards mechanism. Our semantically-guided reward function uses strong matches in caption content to create robust learning signals for refining event boundaries. This ``soft alignment" approach, which decouples the evaluation of content and timing, offers far more informative supervision than standard classification losses. Experimental results demonstrate that DVC-SGRL achieves significant improvements in both localization and captioning performance, ultimately reaching state-of-the-art results on YouCook2 and ActivityNet Captions.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12474
Loading