Exploiting Unlabeled Videos for Video-Text Retrieval via Pseudo-Supervised Learning

Yu Lu

Published: 09 Jan 2024, Last Modified: 10 Jan 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Large-scale pre-trained vision-language models (\eg, CLIP) have shown incredible generalization performance in downstream tasks such as video-text retrieval~(VTR). Traditional approaches have leveraged CLIP's robust multi-modal alignment ability for VTR by directly fine-tuning vision and text encoders with clean video-text data. Yet, these techniques rely on carefully annotated video-text pairs, which are expensive and require significant manual effort. In this context, we introduce a new approach, \textbf{P}seudo-\textbf{S}upervised \textbf{S}elective \textbf{C}ontrastive \textbf{L}earning (\textbf{PS-SCL}). PS-SCL minimizes the dependency on manually-labeled text annotations by generating pseudo-supervisions from unlabeled video data for training. We first exploit CLIP's visual recognition capabilities to generate pseudo-texts automatically. These pseudo-texts contain diverse visual concepts from the video and serve as weak textual guidance. Moreover, we introduce Selective Contrastive Learning (SeLeCT), which prioritizes and selects highly correlated video-text pairs from pseudo-supervised video-text pairs. By doing so, SeLeCT enables more effective multi-modal learning under weak pairing supervision. Experimental results demonstrate that our method outperforms CLIP zero-shot performance by a large margin on multiple video-text retrieval benchmarks, \eg, 8.2\% R@1 for video-to-text on MSRVTT, 12.2\% R@1 for video-to-text on DiDeMo, and 10.9\% R@1 for video-to-text on ActivityNet, respectively.