Exploring Object-Centered External Knowledge for Fine-Grained Video Paragraph Captioning

Published: 01 Jan 2024, Last Modified: 11 Apr 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video paragraph captioning task aims to generate a detailed, fluent and relevant paragraph for a given video. Prior studies often focus on isolating visual objects (potential main components in a sentence) from the overall video content. They rarely explore the latent semantic relations between objects and high-level video concepts, resulting in dull or even incorrect descriptions. To create fine-grained and contextually relevant paragraph captions, we propose a novel framework that constructs a concept graph from a commonsense knowledge base and infers richer semantic meaning from the visual objects. Moreover, we employ a Vision-Guided Concept Selection Network that incorporates an under-sentence supervision mechanism to align the external knowledge with the visual information. Through extensive experiments on ActivityNet captions and YouCook2, the effectiveness of our method is demonstrated compared to state-of-the-art methods.
Loading