GTLR: Graph-Based Transformer with Language Reconstruction for Video Paragraph GroundingDownload PDFOpen Website

2022 (modified: 16 Nov 2022)ICME 2022Readers: Everyone
Abstract: Video Paragraph Grounding aims at retrieving multiple relevant moments from an untrimmed video with a given natural language paragraph query. However, the complex paragraph query brings more challenges to the multimodal fusion and context modeling, which limited the performance of existing VPG methods. To this end, we propose a novel framework for VPG in this paper, termed Graph-based Transformer with Language Reconstruction (GTLR). It consists of three components: (1) Multimodal Graph Encoder conducting the graph reasoning for video-text fusion. (2) Event-wise Decoder predicting the timestamps based on multiple sentence-level features. (3) Language Reconstructor rebuilding the paragraph queries and making our model explainable. We adopt two benchmarks, i.e., ActivityNet-Caption and Charades-STA, to evaluate our model and conduct comprehensive experiments to analyze the effectiveness of each component. The experimental results show that our GTLR method outperforms recent state-of-the-art methods.
0 Replies

Loading