Time-Frequency Mutual Learning for Moment Retrieval and Highlight Detection

Yaokun Zhong, Tianming Liang, Jian-Fang Hu

Published: 01 Jan 2024, Last Modified: 03 May 2025PRCV (5) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Moment Retrieval and Highlight Detection (MR/HD) aims to concurrently retrieve relevant moments and predict clip-wise saliency scores according to a given textual query. Previous MR/HD works have overlooked explicit modeling of static-dynamic visual information described by the language query, which could lead to inaccurate predictions especially when the queried event describes both static appearances and dynamic motions. In this work, we consider learning the static interaction and dynamic reasoning from the time domain and frequency domain respectively, and propose a novel Time-Frequency Mutual Learning framework (TFML) which mainly consists of a time-domain branch, a frequency-domain branch, and a time-frequency aggregation branch. The time-domain branch learns to attend to the static visual information related to the textual query. In the frequency-domain branch, we introduce the Short-Time Fourier Transform (STFT) for dynamic modeling by attending to the frequency contents within varied segments. The time-frequency aggregation branch integrates the information from these two branches. To promote the mutual complementation of time-domain and frequency-domain information, we further employ a mutual learning strategy in concise and effective two-way loop, which enables the branches to collaboratively reason and achieve time-frequency consistent prediction. Extensive experiments on QVHighlights and TVSum demonstrate the effectiveness of our proposed framework as compared with state-of-the-art methods.