FG-CLIP: Fine-Grained Visual and Textual Alignment

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Abstract: Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. We construct a comprehensive dataset, termed FineHARD, by integrating high-quality region-specific annotations with challenging fine-grained negative samples. Corresponding training methods are meticulously designed for these data. Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP's effectiveness in capturing fine-grained image details and improving overall model performance. The data, code, and models are available at https://github.com/360CVGroup/FG-CLIP.
Lay Summary: Modern AI systems often struggle to understand the fine details in images when paired with descriptive text. While models like CLIP have made great progress in matching images and text at a high level, they often struggle with fine-grained understanding due to their focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), a new method designed to enhance the model’s ability to capture and understand detailed visual information. Our approach includes three key innovations. First, we leverage large multimodal models to generate 1.6 billion caption-image pairs with long, detailed descriptions. This helps the model learn global-level semantic details more effectively. Second, we construct a dataset containing 12 million images and 40 million region-specific bounding boxes aligned with detailed captions. This ensures precise and context-rich representations of objects within images. Third, we incorporate 10 million challenging fine-grained negative samples to improve the model's ability to distinguish subtle semantic differences, making it better at recognizing small but important details. By integrating these three types of data into our comprehensive dataset, we meticulously design corresponding training methods to optimize performance. Extensive experiments show that FG-CLIP outperforms existing methods on various tasks requiring detailed understanding, including image-text retrieval, open-vocabulary object detection, and fine-grained recognition. These results demonstrate that FG-CLIP not only captures nuanced visual content more accurately but also enhances the overall capability of vision-language models. We construct a comprehensive dataset, termed FineHARD, by integrating high-quality region-specific annotations with challenging fine-grained negative samples. We release our dataset, code, and models at https://github.com/360CVGroup/FG-CLIP.
Link To Code: https://github.com/360CVGroup/FG-CLIP
Primary Area: Deep Learning->Foundation Models
Keywords: Vision-Language Model, Contrastive Learning, Image-Text Dataset
Submission Number: 1499
Loading