Keywords: Composed Image Retrieval, information retrieval
Abstract: Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a reference image and a modification text, without requiring task-specific training. Most existing methods directly rewrite the query from the multimodal inputs without verification or self-correction, making initial misinterpretations of the user's intent unrecoverable and leading to retrieval failure. To address this limitation, we propose CoRR, a novel training-free framework that reframes ZS-CIR as a dynamic and self-correcting process. In contrast to prior methods, CoRR incorporates evidence from retrieved results as explicit feedback and employs a Multimodal Large Language Model (MLLM) to iteratively refine query representations through a Chain-of-Thought reasoning process. In order to ensure stable query evolution, we employ Spherical Linear Interpolation (Slerp) to fuse historical and newly generated query. Furthermore, we introduce Retrieval-Driven Caption Optimization, which supplies the MLLM with high-fidelity contextual examples to enhance its reasoning and ensure that outputs align with the preferences of the embedding space. Extensive experiments on multiple benchmarks, including CIRCO, CIRR, and FashionIQ, demonstrate that CoRR significantly outperforms existing state-of-the-art methods, establishing the superior effectiveness of our proposed paradigm.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4328
Loading