Abstract: Contrastive Language Image Pre-training (CLIP) has recently made significant advancements in image captioning by providing effective multi-modal representation learning capabilities. However, previous studies primarily rely on the language-aligned visual semantics as input for the captioning model, leaving the learned robust vision-language relevance under-exploited. In this paper, we propose CONICA, a unified CONtrastive Image CAptioning framework that investigates how contrastive learning can further enhance image captioning from three aspects. Firstly, we introduce contrastive learning objectives into the typical image captioning training pipeline with minimal overhead. Secondly, we construct fine-grained contrastive samples to obtain image-text similarities that correlate with the evaluation metric of image captioning. Finally, we incorporate the learned contrastive knowledge into the captioning decoding strategy to search for better captions. Experimental results demonstrate that CONICA significantly improves performance over standard captioning baselines and achieves new state-of-the-art results on the MSCOCO and Flikr30K. Source code is available at https://github.com/DenglinGo/CONICA.
0 Replies
Loading