Abstract: Video Moment Retrieval (MR) tasks involve predicting the moment described by a given natural language or spoken language query in an untrimmed video. In this paper, we propose a novel Maskable Retentive Network (MRNet) to address two key challenges in MR tasks: cross-modal guidance and video sequence modeling. Our approach introduces a new retention mechanism into the multimodal Transformer architecture, incorporating modality-specific attention modes. Specifically, we employ the Unlimited Attention for language-related attention regions to maximize cross-modal mutual guidance. Then, we introduce the Maskable Retention for video-only attention region to enhance video sequence modeling, that is, recognizing two crucial characteristics of video sequences: 1) bidirectional, decaying, and non-linear temporal associations between video clips, and 2) sparse associations of key information semantically related to the query. We propose a bidirectional decay retention mask to explicitly model temporal-distant context dependencies of video sequences, along with a learnable sparse retention mask to adaptively capture strong associations relevant to the target event. Extensive experiments conducted on five popular benchmarks ActivityNet Captions, TACoS, Charades-STA, ActivityNet Speech, and QVHighlights for MR tasks demonstrate the significant improvements achieved by our method over existing approaches. Code is available at https://github.com/xian-sh/MRNet.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: This paper introduces Maskable Retentive Network (MRNet), a novel framework designed to tackle the intricate challenges of Video Moment Retrieval (VMR), a quintessential multimodal task. By integrating a new retention mechanism within a multimodal Transformer framework, MRNet forges a deeper fusion between natural (spoken) language queries and visual content within untrimmed videos. The innovative attention modes, including Unlimited Attention for linguistic guidance and Maskable Retention for enhanced video sequence modeling, enable the model to effectively navigate and interpret the rich, intertwined data of multiple modalities. Our approach is bolstered by bidirectional decay retention masks and learnable sparse retention masks, which capture temporal nuances and filter out redundant information, thereby highlighting key event associations. Empirical results on various benchmarks demonstrate that MRNet substantially outperforms existing methods across Natural Language Moment Retrieval (NLMR), Spoken Language Moment Retrieval (SLMR), and the multi-task of Moment Retrieval and Highlight Detection (MR+HD). This work not only sets a new benchmark in VMR but also contributes broadly to the field of multimedia and multimodal processing by pushing the boundaries of cross-modal understanding and temporal sequence modeling. The advancements presented in this paper hold promise for a range of applications, from interactive media retrieval to intelligent video analytics, actively contributing to the multimedia research community.
Supplementary Material: zip
Submission Number: 174
Loading