Fine-grained video paragraph captioning via exploring object-centered internal and external knowledgeDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Video paragraph captioning task aims at generating a fine-grained, coherent and relevant paragraph for a video. Existing works often treat the objects (the potential main components in a sentence) isolated from the whole video content, and rarely explore the latent semantic relation between a certain object and the current video concepts, causing the generated description dull and even incorrect. Besides, different from images where objects are static, the temporal states of objects are changing in videos. The dynamic information could be contributed to better understand the whole video content. Towards generating a more detailed and stick-to-the-topic paragraph, we propose a novel framework that focuses on exploring the rich semantic and temporal meaning of objects, by constructing the concept graph from the external commonsense knowledge and the state graph from the internal video frames. Extensive experiments on ActivityNet captions and Youcook2 demonstrate the effectiveness of our method compared to the state-of-the-art works. We will release our code on GitHub community.
0 Replies

Loading