CLIP-Driven Low-Cost Image Captioning

Published: 01 Jan 2024, Last Modified: 13 Nov 2024IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Image captioning, as a typical multi-modal task, has received increasing attention and has made significant progress. Recently, the proposal of CLIP has shown strong application potential in various tasks, including image captioning. However, the current approaches simply employ CLIP as a visual feature extractor without fully utilizing its potential benefits from the contrastive learning paradigm. Furthermore, most of them are resource-expensive due to the need for additional large-scale image-text datasets for incremental pre-training or additional pre-trained language models for captioning decoder initialization and finetuning. In this paper, we propose a simple and effective approach that fully utilizes CLIP’s properties and advantages to transfer it to an image captioning model. Specifically, following the obtained visual features, we design a memory retrieval module to introduce more information and design a joint representation module for feature enhancement. Furthermore, our captioning decoder is constructed based on the CLIP text encoder with the reuse of its parameters, which aims to leverage its priori textual embedding knowledge. Experimental results on MSCOCO dataset verify the effectiveness of our proposal.
Loading