Close the Gap: Lightweight Image Captioning via Retrieval Augmentation

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Image Captioning, Vision-Language Models, Foundation Models, Large Language Models
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Image Captioning is important for many applications such as content-based image search or accessibility for visually impaired individuals. To achieve rich language capabilities, recent work conditioned pretrained language models (LMs) on pretrained vision-language models (VLMs) that allow for image inputs. However, pretrained VLMs usually suffer from a modality gap which constitutes the misalignment of image and text representations in the joint embedding space. While this gap can in principle be minimized by finetuning, this is usually costly or often infeasible and requires large amounts of task specific data. To address this issue, we propose to bridge the modality gap at lower costs via a linear mapping that is optimized via a least-squares solution. This does not require gradients and can be computed within minutes, even on CPU. At inference, we apply our mapping to images embedded by the VLM and retrieve the closest captions from the training dataset. Along with an instruction, these captions serve as a prompt for the LM to generate a new caption. In addition, we propose a method to iteratively refine the mapping by bootstrapping synthetic captions from the LM. This enables explicit optimization for commonly used image captioning metrics. We find that reference-free metrics, such as CLIP-score, often assign unusually high scores to hallucinated content. On reference-based metrics, our method achieves competitive performance to lightweight captioning approaches on MS-COCO and Flickr30k datasets.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5139
Loading