Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) has attracted more attention in recent years, focusing on retrieving a specific image based on a query composed of a reference image and a relative text without training samples. Specifically, the relative text describes the differences between the two images. Prevailing ZS-CIR methods employ image-to-text (I2T) models to convert the query image into a single caption, which is further merged with the relative text by text-fusion approaches to form a composed text for retrieval. However, these methods neglect the fact that ZS-CIR entails considering not only the final similarity between the composed text and retrieved images but also the semantic increment during the compositional editing process. To address this limitation, this paper proposes a training-free method called Semantic Editing Increment for ZS-CIR (SEIZE) to retrieve the target image based on the query image and text without training. Firstly, we employ a pre-trained captioning model to generate diverse captions for the reference image and prompt Large Language Models (LLMs) to perform breadth compositional reasoning based on these captions and relative text, thereby covering the potential semantics of the target image. Then, we design a semantic editing search to incorporate the semantic editing increment contributed by the relative text into the retrieval process. Concretely, we comprehensively consider relative semantic increment and absolute similarity as the final retrieval score, which is subsequently utilized to retrieve the target image in the CLIP feature space. Extensive experiments on three public datasets demonstrate that our proposed SEIZE achieves the new state-of-the-art performance. The code is publicly available at https://anonymous.4open.science/r/SEIZE-11BC.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion, [Engagement] Multimedia Search and Recommendation, [Experience] Multimedia Applications
Relevance To Conference: Zero-Shot Composed Image Retrieval (ZS-CIR) is a typical multi-modal retrieval task, which focuses on retrieving a specific image based on a query composed of a reference image and a relative text without training samples. In this work, we propose a novel method, Semantic Editing Increment for ZS-CIR (SEIZE), which has demonstrated significant performance in this task.
Supplementary Material: zip
Submission Number: 5305
Loading