CLIP-based Semantic Enhancement and Vocabulary Expansion for Video Captioning Using Reinforcement Learning
Abstract: Video captioning aims to comprehend the content of videos and automatically generate sentences. It necessitates a network with a robust knowledge background to understand complex video events and transform them into coherent sentences. Traditional video captioning is often limited to modeling close-domain videos and the knowledge is fixed after training, which results in generating short and uninformative captions. Different from traditional video captioning, we propose an open-domain video captioning method that incorporates external textual data and expands the current knowledge domain to enhance the model’s ability to generate more nuanced and contextually relevant descriptive sentences. This paper reduces the gap between videos and texts by employing a well-pretrained CLIP network at the lexical level and effectively retrieves pertinent vocabularies from the training corpus of two datasets to serve as prompts. Our model utilizes retrieval words through implicit augmentation and explicit augmentation, providing additional semantic features as implicit knowledge and explicitly updating the word sampling pool in reinforcement learning. The experiments conducted on several benchmark datasets show that our proposed method is effective.
Loading