Regularized Two Granularity Loss Function for Weakly Supervised Video Moment Retrieval

Junya Teng, Xiankai Lu, Yongshun Gong, Xinfang Liu, Xiushan Nie, Yilong Yin

2022 (modified: 02 Nov 2022)IEEE Trans. Multim. 2022Readers: Everyone

Abstract: Weakly supervised video moment retrieval or weakly supervised language moment retrieval aims to search the most relevant moment given a language query. In order to guide the model to capture the most matching video segments with the text description, we design a two-granularity loss function that simultaneously considers both video-level and instance-level relationships. Specifically, we first generate coarse video segments and regard each video segment as an instance. For video-level regularized multiple instance loss (MIL), we leverage the latent alignment between all intra-video segments ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ie.</i> , positive bag) and text descriptions. Then, we classify these segments by regarding this procedure as a supervised learning task under noisy labels. With the instance-level regularized loss function, our model can learn to correct noisy instance-level labels so as to locate the more accurate frame boundary from all the positive instances. Comprehensive experimental results on <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ActivityNet</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">DiDeMo</i> demonstrate that the proposed loss function sets a new state-of-the-art.

0 Replies