Video Retrieval with Tree-Based Video Segmentation

Seong-Min Kang, Dongin Jung, Yoon-Sik Cho

Published: 01 Jan 2023, Last Modified: 06 Aug 2024DASFAA (3) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Text-to-video retrieval aims to find relevant videos from text queries. The recently introduced Contrastive Language Image Pretraining (CLIP), a pretrained vision-language model trained on large-scale image and caption pairs, has been extensively used in the literature. Existing studies have focused on directly applying CLIP to learn the temporal dependency. While leveraging the dynamics of the video intuitively sounds reasonable, learning temporal dynamics has demonstrated no advantage or only small improvements. When temporal dynamics are not incorporated, most studies focus on constructing representative images from a video. However, we found these images tend to be noisy, degrading the performance of text-to-video task. This observation is the intuition for designing the proposed model, we introduce a novel tree-based frame division method to focus on the most relevant image for learning.