Keywords: multimodality, CLIP
TL;DR: Training ClipCap for image captioning with image captions only.
Abstract: CLIP (Radford et al, 2021) enables strong performance in zero-shot image classification and other single-modality tasks through multi-modal pre-training. Recently, ClipCap (Mokady et al, 2021) demonstrated how the vision encoder of CLIP could be fed into GPT-2 to perform image captioning. In this work, we propose WS-ClipCap, which extends ClipCap to perform weakly-supervised image captioning by training only on the text from image captions. During training, WS-ClipCap encodes image captions using CLIP's text encoder. Then, during inference, WS-ClipCap encodes images using CLIP's vision encoder. Due to CLIP's joint embedding space for different modalities, the image and text representations are similar and can be interchanged. WS-ClipCap outperforms MAGIC (Su et al, 2022) substantially (which trains only on textual image captions) and performs on par with ESPER (Yu et al, 2022) (which trains only on images) while being significantly simpler than both. We also analyze how the performance of WS-ClipCap is affected by the distribution shift between CLIP's multi-modal embeddings and investigate several ways of correcting the distribution mismatch.
Submission Type: archival
Presentation Type: onsite
Presenter: Derek Tam