Classifier-free guidance makes image captioning models more descriptiveDownload PDF

Published: 06 Mar 2023, Last Modified: 01 May 2023MRL 2023Readers: Everyone
Keywords: classifier-free guidance, image captioning, CLIPScore, CLIP, CIDEr, prompt inversion
TL;DR: Image captioning with classifier-free results in captions that are more descriptive and lie closer to the corresponding images in the CLIP embedding space, but yield lower traditional captioning metrics
Abstract: Image captioning is conventionally formulated as the task of generating captions that are similar to a set of human-generated reference captions, as measured using evaluation metrics such as CIDEr, ROUGE, and BLEU. Recent work has also explored reference-free captioning metrics based on the distance between generated captions and the corresponding images in the embedding space of a contrastively-trained image-text model such as CLIP. Here, we show that it is possible to trade off between reference-free and reference-based captioning metrics by decoding from a single autoregressive captioning model using classifier-free guidance (Ho & Salimans, 2021). Compared to standard greedy decoding, decoding from the same model with a guidance scale of 3 substantially improves caption→image retrieval performance when captions and images are embedded using CLIP (recall@1 49.4% vs. 26.5%) and CLIPScore (0.808 vs. 0.775), but greatly worsens standard reference-based captioning metrics (e.g., CIDEr 41.7 vs 126.1). Manual inspection reveals that higher guidance scales produce more descriptive but less grammatical captions.
0 Replies