GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning

ACL ARR 2024 June Submission3696 Authors

16 Jun 2024 (modified: 03 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Video Paragraph Captioning (VPC) aims to generate paragraph captions that summarises key events within a video. Despite recent advancements, challenges persist, notably in effectively utilising multimodal signals inherent in videos and addressing the long-tail distribution of words. The paper introduces a novel multimodal integrated caption generation framework for VPC that leverages information from various modalities and external knowledge bases. Our framework constructs two graphs: a video-specific temporal graph capturing major events and interactions between multimodal information and commonsense knowledge, and a theme graph representing correlations between words of a specific theme. These graphs serve as input for a transformer network with a shared encoder-decoder architecture. We also introduce a node selection module to enhance decoding efficiency by selecting the most relevant nodes from the graphs. Our results demonstrate superior performance across benchmark datasets.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality,cross-modal information extraction,cross-modal application
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models
Languages Studied: english
Submission Number: 3696
Loading