Injecting Image Details into CLIP's Feature SpaceDownload PDF

Published: 01 Feb 2023, Last Modified: 12 Mar 2024ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Text-Based Information Retrieval, Fine-Detail, CLIP, Single Feature, Detail Compression, Complete Cover, Feature Space Alignment, Self-Supervised
TL;DR: We propose a framework, including a model-agnostic complete cover scheme to obtain image patches, a fusing model, the corresponding query proxy loss and a new text-image retrieval benchmark.
Abstract: Although CLIP-like Visual Language Models provide a functional joint feature space for image and text, due to the limitation of the CILP-like model's image input size (e.g., 224), subtle details are lost in the feature representation if we input high-resolution images (e.g., 2240). In this work, we introduce an efficient framework that can produce a single feature representation for a high-resolution image that injects image details and shares the same semantic space as the original CLIP. In the framework, we train a feature fusing model based on CLIP features extracted from a carefully designed image patch method (Complete Cover) that can cover objects of any scale, weakly supervised by image-agnostic class prompted queries. We validate our framework by retrieving images from class prompted queries on the existing real-world and synthetic datasets, showing significant performance improvement on these tasks. Furthermore, to fully demonstrate our framework's detail retrieval ability, we construct a CLEVR-like synthetic dataset called CLVER-DS, which is fully annotated and has a controllable object scale.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/arxiv:2208.14649/code)
15 Replies

Loading