Abstract: CLIP has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, it has been widely adopted as the vision backbone of multimodal large language models (MLLMs). The success of CLIP relies on aligning web-crawled noisy text annotations at image levels. However, such criteria may be insufficient for downstream tasks in need of fine-grained vision representations, especially when understanding region-level is demanding for MLLMs. We improve the localization capability of CLIP with several advances. Our proposed pre-training method, Contrastive Localized Language-Image Pre-training (CLOC), complements CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text labels. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.
Lay Summary: Modern AI systems that understand images and text—like those that can describe a photo or answer questions about a picture—often rely on a popular model called CLIP. CLIP learns by matching images with their captions. Such pairs of data are collected on a billion-scale from the internet. While this works well for many tasks, it struggles when more detailed understanding is needed, such as identifying specific objects or regions within an image.
To improve this, we developed a new method called CLOC (Contrastive Localized Language-Image Pre-training). Unlike CLIP, CLOC doesn’t just look at whole images and their captions. Instead, it also teaches the model to connect smaller image regions with detailed text descriptions. We introduced a concept called ''promptable embeddings'', which lets the model easily adapt its understanding to different image areas when given location hints.
We also created a pipeline to automatically generate fine-grained region descriptions at scale. With this, we trained CLOC on billions of images. Our results show that CLOC helps AI systems perform better on tasks requiring more precise image understanding, like pinpointing objects being referred to in a sentence.
Primary Area: Deep Learning->Foundation Models
Keywords: CLIP, MLLM, Foundation Models
Submission Number: 8216
Loading