Keywords: Multimodal Learning, Composed Image Retrieval, Large Vision Language Models
Abstract: Composed Image Retrieval (CIR) requires retrieving a target image based on a composed query consisting of an image and accompanying text that modifies or instructs changes to the visual reference. This task is particularly challenging as it demands the model effectively follow modification instructions for accurate retrieval. Additionally, data acquisition difficulties hinder training models for specific tasks. To address these challenges, recent approaches explore Zero-Shot CIR (ZS-CIR), mainly leveraging CLIP-based models with tailored projections to compose images and textual modifications. However, these base models are not trained on instruction-aware data, limiting their ability to effectively combine visual and textual cues. In this paper, we propose a novel embedding method utilizing an instruction-tuned Multimodal Large Language Model (MLLM) to generate unified embeddings that seamlessly integrate images and modification instructions. Instruction-tuned MLLMs inherently align vision and text while exhibiting strong instruction-following capabilities, though they are primarily used in text generation. We introduce a two-stage training strategy to efficiently transform the MLLM’s text generation capabilities into embedding extraction, and further refining its ability to follow modification instructions in CIR. Our model demonstrates significant advancements in ZS-CIR, outperforming state-of-the-art baselines across four public datasets: FashionIQ, CIRR, GeneCIS, and CIRCO. Our model highlights the potential of instruction-tuned MLLMs in capturing nuanced instruction comprehension and advancing CIR systems.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11333
Loading