FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations

Cheng-Yu Hsieh; Pavan Kumar Anasosalu Vasu; Fartash Faghri; Raviteja Vemulapalli; Chun-Liang Li; Ranjay Krishna; Oncel Tuzel; Hadi Pouransari

FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations

Cheng-Yu Hsieh, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, Hadi Pouransari

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Conditional Image Representation, Instruction tuning, Contrastive Learning, Vision-Language Models

TL;DR: We leverage contrastive instruction tuning to train text-conditioned vision encoders that produce representations aligned with specific conditions of interest in a zero-shot manner.

Abstract: Visual feature extraction is fundamental to many vision tasks. Most existing methods extract visual features by encoding an image into a generic feature vector. However, an image naturally contains rich information, and there may be multiple perspectives to describe it. For each application, we might be interested in different aspects of an image and want to prioritize those features over others. For instance, in an image of a dog carrying a toy, if we are primarily interested in the dog, we would expect the extracted features to emphasize the dog over the toy. In this work, we introduce FocalLens, a conditional visual feature extraction method that produces different representations for the same image based on the context of interest, expressed flexibly through natural language. We leverage vision instruction tuning data and contrastively tune a pretrained vision encoder to take natural language instructions as additional inputs and produce conditional image representations. Extensive experiments validate that conditional image representation from FocalLens better pronounce the visual features of interest compared to generic features produced by standard vision encoders like CLIP. In addition, we show FocalLens further leads to performance improvements on a range of downstream tasks including image-image retrieval, image classification, and image-text retrieval, with an average gain of 5 and 10 points on the challenging SugarCrepe and MMVP-VLM benchmarks, respectively.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11721

Loading