Abstract: Video captioning remains a challenging task in computer vision, with existing approaches suffering from temporal inconsistencies and incomplete coverage of objects. In this work, we address these challenges by introducing ReCap, a novel video captioning framework aimed at minimizing inconsistency and maximizing completeness with respect to Significant Objects—those persisting prominently in the video beyond a defined temporal threshold. ReCap operates in two stages: In the first stage, frame-level captions are generated using the vision-language model mPLUG with a recursive decoding strategy that explores multiple token branches. Each token is then rectified using an external object rectifier (YOLOv8), ensuring that only visually grounded objects are present. In the second stage, the rectified captions are passed to a Large Language Model (LLM) with explicit prompt to produce a coherent, comprehensive video-level caption summarizing the frame-level captions. To evaluate semantic accuracy and temporal coverage, we introduce two metrics: (a) Temporal Inconsistency and (b) Temporal Completeness. We compare the performance over the two datasets used by the current state-of-the art video captioning model, mPLUG-2: MSRVTT and MSVD. On MSRVTT, ReCap achieves a 40.3% reduction in inconsistency and a 28.9% improvement in completeness over mPLUG-2 at a 0.3 significance threshold. On MSVD, it yields a 33.% lower inconsistency and a 20.8% gain in completeness, averaged across thresholds. Our results demonstrate that object-grounded rectification combined with LLM-based summarization yields more accurate and exhaustive video captions, making ReCap a promising solution for comprehensive video understanding.
External IDs:dblp:conf/acpr/AdhikaryHC25
Loading