Learnable Negative Proposals Using Dual-Signed Cross-Entropy Loss for Weakly Supervised Video Moment Localization

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Most existing methods for weakly supervised video moment localization use rule-based negative proposals. However, the rule-based ones have a limitation in capturing various confusing locations throughout the entire video. To alleviate the limitation, we propose learning-based negative proposals which are trained using a dual-signed cross-entropy loss. The dual-signed cross-entropy loss is controlled by a weight that changes gradually from a minus value to a plus one. The minus value makes the negative proposals be trained to capture query-irrelevant temporal boundaries (easy negative) in the earlier training stages, whereas the plus one makes them capture somewhat query-relevant temporal boundaries (hard negative) in the later training stages. To evaluate the quality of negative proposals, we introduce a new evaluation metric to measure how well a negative proposal captures a poorly-generated positive proposal. We verify that our negative proposals can be applied with negligible additional parameters and inference costs, achieving state-of-the-art performance on three public datasets.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Engagement] Multimedia Search and Recommendation, [Experience] Multimedia Applications, [Content] Multimodal Fusion
Relevance To Conference: This paper presents a novel method to video grounding that aims to find relevant video segments by understanding multimodal information from videos and natural languages. This video grounding method can be utilized in various multimedia applications such as video understanding, video summarization, video search, and video question answering. Moreover, since the proposed method is weakly supervised, manual annotations for temporal locations are not required and only video-sentence pairs are required for training. Therefore, it is much easier to collect a large amount of multimedia data for training, because the video-sentence pairs can be obtained from metadata on the Internet or through automatic speech recognition (ASR). This makes the proposed method applicable to very large-scale multimodal learning.
Supplementary Material: zip
Submission Number: 3326
Loading