ColCLIP: Enhancing Fine-Grained Image Retrieval with Pre-trained Embeddings

19 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Multimodal Learning, Image, Language, Retrieval
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We introduce ColCLIP, a fine-grained image retrieval system that refines pre-trained CLIP embeddings for better alignment with specific visual elements in queries.
Abstract: In the realm of image retrieval systems, efficiently searching for images based on any visual element described in the query is critical for user experience. However, current embedding models like CLIP primarily focus on aligning text with the most salient aspects of images, which may not always correspond to the elements users seek. In this paper, we propose ColCLIP, a fine-grained image retrieval system that leverages pre-trained embeddings and enhances them for our use case. We fine-tune CLIP on the Visual Genome Dataset and incorporate the MaxSim operator for image-text interaction. Our evaluations show that ColCLIP consistently outperforms standard CLIP in handling fine-grained retrieval tasks. ColCLIP improves image retrieval systems by enabling more relevant searches for users while maintaining efficiency and ease of development. We release our code at https://anonymous.4open.science/r/image-is-context-32B6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1637
Loading