Alignment and Generation Adapter for Efficient Video-text Understanding

Published: 01 Jan 2023, Last Modified: 13 Nov 2024ICCV (Workshops) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Pre-trained models have demonstrated considerable performance, especially in enhancing cross-modal understanding between videos and text. However, fine-tuning them at scale becomes costly and poses challenges for adapting to various downstream tasks. To tackle these challenges, we propose the Alignment-generation Adapter (AGAdapter), establishing semantic coherence between alignment and generation models for efficient video-text adaptation across multiple tasks simultaneously. We propose an alignment adapter with knowledge-sharing to adapt the frozen CLIP model for fine-grained video-language interaction. Additionally, we introduce the generation adapter with prompt tuning to leverage the large language model for captioning. Furthermore, we introduce instruction joint tuning, combining textual and cross-modal instructions, to capture detailed descriptions. Our AGAdapter achieves state-of-the-art performance on video-text retrieval and video captioning tasks, including two benchmarks, MSR-VTT and ActivityNet.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview