Abstract: Image captioning has drawn remarkable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without relying on any humanly-annotated image-text pairs, although existing methods are still outperformed by fully supervised approaches. This paper proposes TTLLCap, i.e. a text-only training method for image captioning, based on prompting a pre-trained language model decoder with information obtained from CLIP representations of the inputs. Specifically, we experimented with the combined use of (a) retrieved examples of captions, (b) relevant concepts for the input, and (c) latent vector representations. Through extensive experiments, we show that TTLLCap outperforms previous training-free methods, and is also competitive with other text-only training methods. We also analyze the impact of different choices regarding the configuration of the retrieval-augmentation component. The source code supporting our experiments is available from a public GitHub repository.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal application,cross-modal content generation,multimodality
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 1345
Loading