Keywords: CLIP, Vision Language Model, Contrastive Learning
Abstract: CLIP is one of the most important foundational multimodal models today. It aligns image and text modalities into a shared feature space by leveraging a simple contrastive learning loss on massive image-text pairs.
As a retriever, CLIP supports tasks such as zero-shot classification, detection, segmentation, and image-text retrieval. Furthermore, as a cross-modal feature extractor, it enables tasks like image understanding, video understanding, and text-to-image generation. However, as expectations around model generalization and the complexity of tasks increase, the original learning paradigm of CLIP shows limitations in feature extraction capabilities. Specifically, the bag-of-words nature of CLIP's text encoder is often criticized for its inability to extract fine-grained or complex features. We believe these limitations stem from two core issues: the simplicity of the training captions and the fact that CLIP's self-supervised task does not require logical reasoning to succeed. Additionally, the small-scale text encoder used in CLIP cannot fully understand high-quality caption data.
In this work, we propose a post-finetuning approach for CLIP by introducing large language models (LLMs) into the training process to leverage more sophisticated textual data. Our experiments demonstrate that even with minimal additional training, LLMs can be aligned with the pretrained CLIP visual encoder, providing higher-dimensional and effective supervision to overcome CLIP's original limitations.
Submission Number: 32
Loading