Keywords: CLIP, Vision-language models, Language models, LLMs, Image Captioning
TL;DR: We advocate for linear alignment of CLIP-style models via orthogonal procrustes and illustrate its benefits on caption generation and evaluation.
Abstract: Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches adapt CLIP-style models to a downstream task by training a mapping network between CLIP and a language model. This is costly as it usually involves calculating gradients for large models. We propose a more efficient training protocol that fits a linear mapping between image and text embeddings of CLIP via a closed-form solution. This bypasses the need for gradient computation and results in a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods. Moreover, we propose two new learning-based image-captioning metrics that build on CLIP score along with our linear mapping. We evaluate ReCap on MS-COCO, Flickr30k, VizWiz, and MSRVTT. ReCap achieves performance comparable to state-of-the-art lightweight methods on established metrics while outperforming them on our new metrics, which are better aligned with human judgement than established ones.
Submission Number: 17
Loading