Dual-branch Complementary Multimodal Interaction Mechanism for Joint Video Moment Retrieval and Highlight Detection

Published: 01 Jan 2024, Last Modified: 19 May 2025SMC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Joint video moment retrieval and highlight detection is a video understanding task that requires the model to construct multimodal interaction between heterogeneous features. Recent Transformer-based models mainly focus on promoting global interaction between features. However, local interaction and temporal asynchronism modeling are not deeply considered. To solve this problem, this paper proposes a dual-branch complementary multimodal interaction mechanism (DCMI), which consists of a global difference feature activation module (GDFA) and a local information dynamic aggregation module (LIDA). GDFA measures the difference between the target element and the global features, thus activating important information. LIDA designs a multimodal heterogeneous graph and constructs asynchronous interaction between heterogeneous features to dynamically aggregate local information. DCMI adaptively fuses the complementary dual branches to improve the model's cognitive and decision-making abilities of global and local information. Comprehensive comparisons with existing methods on public datasets verify the superiority of the proposed model. Extensive ablation experiments and qualitative analysis show the effectiveness and rationality of DCMI, which can promote the interaction between multimodal features.
Loading