Abstract: Image captioning refers to generating a corresponding natural language caption for a given image. This task is usually thought to be a large amount of supervised training to align the two modalities of image and text, which can be suffered from expensive occupation of time and resources. Recently, unsupervised methods based on multi-modal models for image captioning have attracted more attentions. However, the generated captions are often limited by the structure of the multi-modal model and cannot accurately capture essential information within the images. To address these limitations, we propose a method named FocusCap, which combines the CLIP multi-modal model with a pre-trained language model, for image captioning in an unsupervised way. FocusCap uses CLIP to calculate the similarity between image and text, as well as between objects and text. The calculated similarity score is used as visual information to control the language model generation process. FocusCap leverages the multi-modal feature information extracted in CLIP to guide the language model for text generation, thereby simplifying the costly process of large-scale training for image-text feature extraction and modality alignment. Experimental results show that our proposed FocusCap outperforms existing zero-shot methods on the Microsoft COCO and Flickr30K datasets.
Loading