MPT: Multi-grained Prompt Tuning for Text-Video Retrieval

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recently, significant advancements have been made in supporting text-video retrieval by transferring large-scale image-text pre-training models through model adaptation, i.e., full fine-tuning, or prompt tuning, a parameter-efficient fine-tuning strategy. While full fine-tuning involves high computational costs, particularly with increasing model size, prompt tuning offers greater flexibility and efficiency by adjusting only a few learnable parameters. However, current prompt tuning methods rely on coarse visual and textual cues for text-video retrieval task, neglecting the domain-specific features when performing the adaptation. This approach may lead to sub-optimal performance due to the incorporation of irrelevant and indiscriminate knowledge. To address such an issue, we present a Multi-grained Prompt Tuning (MPT) for text-video retrieval, that designs a variety of specific prompts to effectively explore semantic interaction across different modalities with diverse granularity. Specifically, we devise a multi-grained video encoder that employs spatial, temporal, and global prompts to transfer the base-generic knowledge from the image-text pre-trained model while comprehensively excavating determinative video-specific characteristics. Meanwhile, we introduce a novel multi-grained text encoder aimed at capturing various levels of textual clues through the utilization of word and phrase prompts. Extensive experiments on four benchmark datasets, i.e., MSR-VTT, ActivityNet, DiDeMo, and LSMDC, demonstrate that MPT achieves outstanding performance, surpassing state-of-the-art methods with negligible computational cost. The codebase is publicly available at: https://github.com/zchoi/MPT.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: This paper presents a multi-grained prompt tuning for text-video retrieval, which employs various prompts to capture domain-specific features from both text and video modalities when performing pre-trained adaptation. Multimodal comprehension and analysis of video and text often require a large number of high-quality samples and computational resources, whereas previous full fine-tuning methods are difficult to practically apply for real multimedia retrieval systems and prone to prior knowledge loss. To tackle this, we propose a prompt tuning-based approach to accomplish cross-modal retrieval, which fully considers the domain-specific semantic information by incorporating multiple types of prompts according to the characteristics of video and text. The proposed method achieves the best performance-computation trade-off on the text-video task. The proposed method is practical for video-based applications such as video search, video summarization, and open-set retrieval systems, which contributes to advancing the state-of-the-art in multimedia and multimodal processing.
Submission Number: 579
Loading