Prompting vision-language fusion for Zero-Shot Composed Image Retrieval

Published: 05 Sept 2024, Last Modified: 16 Oct 2024ACML 2024 Conference TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Composed image retrieval, zero-shot, prompting fusion, vision-language model
Verify Author List: I have double-checked the author list and understand that additions and removals will not be allowed after the submission deadline.
TL;DR: We propose a simple yet effective method to prompt vision-language fusion (PVLF) for zero-shot composed image retrieval (ZS-CIR).
Abstract: The composed image retrieval (CIR) aims to retrieve target image given the combination of an image and a textual description as a query. Recently, benefiting from vision-language pretrained (VLP) models and large language models (LLM), the use of textual inversion or generating large-scale datasets has become a novel approach for zero-shot CIR task (ZS-CIR). However, the existing ZS-CIR models overlook one case where the textual description is often too brief or inherently inaccurate, making it challenging to effectively integrate the reference image into the query for retrieving the target image. To address this problem, we propose a simple yet effective method---prompting vision-language fusion (PVLF), which adapts representations in VLP models to dynamically fuse the vision and language (V\&L) representation spaces. In addition, by injecting the context learnable prompt tokens in Transformer fusion encoder, the PVLF promotes the comprehensive coupling between V\&L modalities, enriching the semantic representation of the query. We evaluate the effectiveness and robustness of our method on various VLP backbones, and the experimental results show that the proposed PVLF outperforms previous methods and achieves the state-of-the-art on two public ZS-CIR benchmarks (CIRR and FashionIQ).
A Signed Permission To Publish Form In Pdf: pdf
Primary Area: Applications (bioinformatics, biomedical informatics, climate science, collaborative filtering, computer vision, healthcare, human activity recognition, information retrieval, natural language processing, social networks, etc.)
Paper Checklist Guidelines: I certify that all co-authors of this work have read and commit to adhering to the guidelines in Call for Papers.
Student Author: Yes
Submission Number: 197
Loading