Abstract: Paragraph Video captioning seeks to automatically describe multiple events in a video. Despite significant progress, most current approaches fail to fully leverage scene graph knowledge when performing cross-modal alignment between video and text representations. Consequently, such methods may not learn causal associations between entities, leading to a degradation in captioning performance. In this paper, we propose an end-to-end Vision-Language Scene Graphs Network (VLSG-net) to address this issue. We first introduce an encoder that integrates scene graph knowledge with global features and predicates to understand visual scenes. Specifically, scene graph knowledge detects entities and models their correlations and constraints, enabling the representation of relationships between various entities. We then introduce a Knowledge-Enhanced Encoder paired with a contrastive loss to leverage scene graph knowledge, thereby enhancing multimodal structured representations. Finally, we propose a transformer-in-transformer decoder to model the coherency of intra- and inter-event relationships within the video and generate captions. By incorporating relationship reasoning among entities through scene graphs and video-language alignment learning, VLSG-net generates more logical and detailed captions. Extensive experiments confirm that VLSG-net performs favorably against the state-of-the-art methods on two widely used benchmark datasets, ActivityNet Captions, and YouCookII.
Loading