MoS$^2$: Mixture of Scale and Shift Experts for Text-Only Video Captioning

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Video captioning is a challenging task and typically requires video-text paired data for training. However, manually annotating coherent textual descriptions for videos is laborious and time-consuming. To address this problem, we propose to utilize solely text data to enhance video captioning models. Drawing inspiration from the exceptional text generation capabilities demonstrated by large language models (LLMs), we aim to leverage these models to generate high-quality and high-diversity video captions for the target domain. Specifically, we prompt GPT-4 with few-shot target-domain captions to generate a limited set of plausible video captions. Subsequently, we continue to prompt GPT-4 with the generated captions to acquire large-scale captions. To fully exploit the generated captions, we propose a Mixture of Scale and Shift experts (MoS$^2$) for efficient adaptation of pre-trained image captioning models for video captioning. MoS$^2$ estimates a probability distribution over a collection of experts by a lightweight routing network, determining the allocation of tokens to appropriate experts. This dynamic adjustment mechanism allows for specific responses to input features, thereby enhancing the model's ability to handle data variations. Our approach not only customizes model responses to input variations, effectively addressing the distribution shift between synthetic and actual captions but also significantly reduces the number of learnable parameters, allowing for more efficient adaptations. With only text data, we achieve superior performance and significantly narrow the performance gap between zero-shot and fine-tuned models. Our method boosts video captioning performance with the synthetic text data, thus substantially alleviating the dependence on paired and large-scale real data of the target domain. The code will be publicly available.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: This work introduces a novel text-only training pipeline that significantly enhances domain-specific video captioning by leveraging solely text data. We conduct text-only training for video captioning for the first time and validate its effectiveness. Our approach utilizes the exceptional generative capabilities of GPT-4 to produce high-quality, domain-specific captions, effectively reducing reliance on paired data and associated costs. We develop a novel mixture of scale and shift experts architecture, which addresses the distribution shifts between synthetic and actual captions, enhancing model flexibility and video captioning performance. Notably, our approach achieves near full-text performance with minimal real training examples (10-shot) and requires updating only 0.4% of parameters to surpass existing parameter-efficient learning methods. Employing only text data, our approach not only demonstrates superior performance but also significantly bridges the gap between zero-shot and fine-tuned models. By enhancing video captioning with synthetic textual data, our method greatly reduces reliance on large-scale, domain-specific paired datasets.
Supplementary Material: zip
Submission Number: 1503
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview