PatchCraft: Learning Optimized Image Patch for Enhanced Visual Attention of CLIP

19 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Primary Area: visualization or interpretation of learned representations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: vision-language models, explainable AI, transformer-based models
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: we propose a method to learn an optimized patch which can enhance visual attention of vision-language models like CLIP
Abstract: Large-scale Vision-Language Models, such as CLIP, have gained proficiency in building robust connections between images and text, offering a diverse range of practical applications, from tasks like zero-shot classification to generating images based on textual descriptions. However, when compared to their counterparts in the realm of large language models like GPT-3, these vision-language models exhibit somewhat limited abilities in handling novel discriminative tasks using prompts. In this research, we explore the concept of creating visual prompts as a strategy to address computer vision tasks that go beyond mere classification. Instead of relying exclusively on text-based prompts, our investigation delves into the potential of directly manipulating images. More specifically, we've discovered an intriguing capability within CLIP – the ability to direct the model's attention to specific regions within an image by introducing an optimized patch onto that region. This approach allows us to concentrate on local details while preserving the overall contextual understanding. Our experiments have demonstrated the effectiveness of this straightforward technique, enabling us to attain strong performance in keypoint localization and naming keypoint tasks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2027
Loading