Exploiting Unlabeled Videos for Video-Text Retrieval via Pseudo-Supervised Learning
Abstract: Large-scale pre-trained vision-language models (\eg, CLIP) have shown incredible generalization performance in downstream tasks such as video-text retrieval~(VTR).
Traditional approaches have leveraged CLIP's robust multi-modal alignment ability for VTR by directly fine-tuning vision and text encoders with clean video-text data.
Yet, these techniques rely on carefully annotated video-text pairs, which are expensive and require significant manual effort.
In this context, we introduce a new approach, \textbf{P}seudo-\textbf{S}upervised \textbf{S}elective \textbf{C}ontrastive \textbf{L}earning (\textbf{PS-SCL}).
PS-SCL minimizes the dependency on manually-labeled text annotations by generating pseudo-supervisions from unlabeled video data for training.
We first exploit CLIP's visual recognition capabilities to generate pseudo-texts automatically. These pseudo-texts contain diverse visual concepts from the video and serve as weak textual guidance.
Moreover, we introduce Selective Contrastive Learning (SeLeCT), which prioritizes and selects highly correlated video-text pairs from pseudo-supervised video-text pairs. By doing so, SeLeCT enables more effective multi-modal learning under weak pairing supervision.
Experimental results demonstrate that our method outperforms CLIP zero-shot performance by a large margin on multiple video-text retrieval benchmarks, \eg, 8.2\% R@1 for video-to-text on MSRVTT, 12.2\% R@1 for video-to-text on DiDeMo, and 10.9\% R@1 for video-to-text on ActivityNet, respectively.
Loading