Show, Think, and Tell: Thought-Augmented Fine-Tuning of Large Language Models for Video Captioning

Byoungjip Kim; Dasol Hwang; Sungjun Cho; Youngsoo Jang; Honglak Lee; Moontae Lee

Show, Think, and Tell: Thought-Augmented Fine-Tuning of Large Language Models for Video Captioning

Byoungjip Kim, Dasol Hwang, Sungjun Cho, Youngsoo Jang, Honglak Lee, Moontae Lee

Published: 01 Jan 2024, Last Modified: 11 Nov 2024CVPR Workshops 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large language models (LLMs) have achieved a great success in natural language processing, and have a significant potential for multi-modal applications. Despite the surprising zero-shot or few-shot ability, it is also required to effectively fine-tune pre-trained language models for specific downstream tasks. In this paper, we introduce CaptionT5, a video captioning model that fine-tunes T5 towards understanding videos and generating descriptive captions. To generate a more corespondent caption, CaptionT5 introduces thought-augmented fine-tuning for video captioning, in which a pre-trained language model is fine-tuned on thought-augmented video inputs. This resembles the process that human see a video, think of visual concepts such as objects and actions, and then tell a correct and natural sentence based on the thoughts. To automatically generate thoughts, we propose (1) CLIP-guided thought sampling that samples thoughts based on the similarity in an image-text multimodal embedding space by leveraging CLIP. We also propose (2) CLIP-guided caption ranking during decoding for further performance gains. Through experimentation on VATEX, MSRVTT, and YC2 datasets, we empirically demonstrate that CaptionT5 performs competitively against prior-art video captioning approaches without using encoders specialized for video data. Further experiments show that CaptionT5 is especially effective under small number of sampled video frames.

Loading