RelationCLIP: Training-free Fine-grained Visual and Language Concept MatchingDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Zero-shot, Image-text Matching, CLIP
TL;DR: A simple but effective method that can improve zero-shot performance for CLIP-like models on fine-grained Image-text matching datasets
Abstract: Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for image-text matching, because of its holistic use of natural language supervision that covers large-scale, unconstrained real-world visual concepts. However, it is still challenging to adapt CLIP to fined-grained image-text matching between disentangled visual concepts and text semantics without training. Towards a more accurate zero-shot inference of CLIP-like models for fine-grained concept matching, in this paper, we study the image-text matching problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel training-free framework, RelationCLIP, by disentangling input images into subjects, objects, and action entities. By exploiting fine-grained matching between visual components and word concepts from different entities, RelationCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically assess the contribution of each entity when performing image and text matching. Experiments on SVO-Probes and our newly-introduced Visual Genome Concept datasets demonstrate the effectiveness of our plug-and-play method, which boosts the zero-shot inference ability of CLIP even without pre-training or fine-tuning. Our code is available at https://anonymous.4open.science/r/Relation-CLIP.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
5 Replies

Loading