OC-CLIP : Object-centric binding in Contrastive Language-Image Pretraining

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: object-centric representations, object binding, CLIP, contrastive learning, compositional image-to-text retrieval
TL;DR: We analyze and propose an object-centric solution to the binding problem in CLIP
Abstract: Recent advancements in vision-language models (VLMs) have been driven by contrastive models like CLIP which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from traditional data-centric methods of enhancing model performance with hard negatives examples. Our work instead focuses on integrating sufficient inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using additional data annotations. We introduce a binding module that connects a scene graph of the text with an induced graph-like representation of the image, facilitating a structured similarity assessment. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model (OC-CLIP) not only enhances the performance of CLIP in multi-object compositional understanding but also paves the way for more accurate and efficient image-text matching in complex scenes.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4586
Loading