Abstract: Video Moment and Highlight Retrieval (VMHR) aims at retrieving video events with a text query in a long untrimmed video and selecting the most related video highlights by assigning the worthiness scores. However, we observed existing methods mostly have two unavoidable defects: 1) The temporal annotations of highlight scores are extremely labor-cost and subjective, thus it is very hard and expensive to gather qualified annotated training data. 2) The previous VMHR methods would fit the temporal distributions instead of learning vision-language relevance, which reveals the limitations of the conventional paradigm on model robustness towards biased training data from open-world scenarios. In this paper, we propose a novel method termed Query as Supervision (QaS), which jointly tackles the annotation cost and model robustness in the VMHR task. Specifically, instead of learning from the distributions of temporal annotations, our QaS method completely learns multimodal alignments within semantic space via our proposed Hybrid Ranking Learning scheme for retrieving moments and highlights. In this way, it only requires low-cost annotations and also provides much better robustness towards Out-Of-Distribution test samples. We evaluate our proposed QaS method on three benchmark datasets, i.e., QVHighlights, BLiSS, and Charades-STA and their biased training version. Extensive experiments demonstrate that the QaS outperforms existing state-of-the-art methods under the same low-cost annotation settings and reveals better robustness against biased training data. Our code is available at https://github.com/CFM-MSG/Code_QaS.
External IDs:dblp:journals/tcsv/JiangZXSYS25
Loading