Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval

Ziyue Wu, Junyu Gao, Shucheng Huang, Changsheng Xu

2021 (modified: 21 Oct 2022)ICME 2021Readers: Everyone

Abstract: Existing dominant approaches for video moment retrieval task are to learn semantic correlation between a given query and the video. However, these methods rarely explore the fine-grained semantic structure and comprehensive visual structure, leading to insufficient utilization of textual and visual relations. In this paper, we propose a unified framework for video moment retrieval, which considers to simultaneously encode semantic and visual structures. Specifically, a semantic role tree is built to reveal the fine-grained semantic information by generating hierarchical textual embeddings. Then the semantic structure is adopted to facilitate the visual structure learning with a contextual attention-based proposal interaction module. Finally, we adaptively aggregate and obtain the visual-semantic matching information through a multi-level fusion strategy to select the best matching moment proposal. Extensive experiments on two popular benchmarks (Charades-STA and ActivityNet Captions) show that our proposed method achieves state-of-the-art performance. Codes are available in the Supplementary Material.

0 Replies