Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

ACL ARR 2024 June Submission1954 Authors

15 Jun 2024 (modified: 19 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recently, training an image captioner without annotated image-sentence pairs has gained traction. Previous methods face limitations due to either using mismatched corpora for inaccurate pseudo pairs or relying on resource-intensive pre-training. To alleviate these challenges, we propose a new strategy where the prior knowledge from large pre-trained models (LPMs) is distilled and leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce $\textbf{R}$etrieval-$\textbf{a}$ugmented $\textbf{P}$seudo $\textbf{S}$entence $\textbf{G}$eneration (RaPSG), which can efficiently retrieve highly relevant short region descriptions from mismatching corpora and use them to generate a variety of high-quality pseudo sentences via LPMs. Additionally, we introduce a fluency filter to eliminate low-quality pseudo sentences and a CLIP guidance objective to enhance contrastive information learning. Experimental results show that our method outperforms SOTA captioning models in zero-shot, unsupervised, semi-supervised, and cross-domain scenarios. Moreover, we observe that generating high-quality pseudo sentences may offer better supervision than the crawling sentence strategy, highlighting future research opportunities.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal content generation; image text matching; cross-modal information extraction

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis

Languages Studied: English

Submission Number: 1954

Loading