LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval

Published: 01 Jan 2024, Last Modified: 13 Nov 2024SIGIR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) has garnered increasing interest in recent years, which aims to retrieve a target image based on a query composed of a reference image and a modification text without training samples. Specifically, the modification text describes the distinction between the two images. To conduct ZS-CIR, the prevailing methods employ pre-trained image-to-text models to transform the query image and text into a single text, which is then projected into the common feature space by CLIP to retrieve the target image. However, these methods neglect that ZS-CIR is a typicalfuzzy retrieval task, where the semantics of the target image are not strictly defined by the query image and text. To overcome this limitation, this paper proposes a training-free LLM-based Divergent Reasoning and Ensemble (LDRE) method for ZS-CIR to capture diverse possible semantics of the composed result. Firstly, we employ a pre-trained captioning model to generate dense captions for the reference image, focusing on different semantic perspectives of the reference image. Then, we prompt Large Language Models (LLMs) to conduct divergent compositional reasoning based on the dense captions and modification text, deriving divergent edited captions that cover the possible semantics of the composed target. Finally, we design a divergent caption ensemble to obtain the ensemble caption feature weighted by semantic relevance scores, which is subsequently utilized to retrieve the target image in the CLIP feature space. Extensive experiments on three public datasets demonstrate that our proposed LDRE achieves the new state-of-the-art performance.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview