Part-Aware CLIP: Enhancing Fine-Grained Understanding with Part-level Descriptions

Sibo Yin; Yuxin Peng

Part-Aware CLIP: Enhancing Fine-Grained Understanding with Part-level Descriptions

Sibo Yin, Yuxin Peng

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, Contrastive Learning

Abstract: Vision-Language Pre-trained (VLP) models, such as CLIP, learn powerful representations from large-scale image-text pairs, achieving remarkable zero-shot capabilities. However, by learning to align visual and textual features only at a global level, these models exhibit two critical limitations: poor interpretability and weak fine-grained perception. This deficiency arises from the pre-training data that overlooks key visual details of object parts, which are often important for distinguishing between subordinate categories (e.g., species of birds or models of cars). This fundamental weakness is inherited by Multimodal Large Language Models (MLLMs) that use CLIP-based vision encoders, limiting both their accuracy and trustworthiness. To address these challenge, we introduce Part-Aware CLIP (PA-CLIP), a framework designed to enhance both the fine-grained perception and interpretability of VLP models. First, we employs MLLMs to create a new dataset FG-Part, including approximately 1 million part-level image-text pairs. FG-Part explicitly captures the critical visual details of object components (e.g., a bird's beak and wing patterns). Second, we design a part-aware training strategy that leverages this curated data to compel the model to ground fine-grained textual descriptions in specific image regions. By forcing this explicit, part-level alignment, our method enhances the model's ability to perceive key details, thereby improving its fine-grained recognition capabilities and inherent interpretability. Extensive experiments show that PA-CLIP achieves state-of-the-art performance on multiple fine-grained visual recognition (FGVR) benchmarks, highlighting the effectiveness of part-level captions and the capabilities of PA-CLIP in capturing subtle visual details. Furthermore, evaluations on general tasks, including cross-modal retrieval, confirm that these gains do not compromise the model's core generalist capabilities.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 4500

Loading