Contextual Reasoning for Robust Composed Image Retrieval with Vision-Language Models

Published: 01 Jan 2025, Last Modified: 05 Nov 2025ICMR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Composed Image Retrieval (CIR) combines a reference image with modification text for precise and flexible searches. However, existing methods face two key challenges: first, the limited information in modification text hampers the model's ability to understand user intent, leading to reduced accuracy and diversity; second, reliance on unidirectional constraints overlooks the complementary role of reference and target captions. In this paper, we propose CR-CIR a novel framework that leverages Contextual Reasoning and vision-language models to enhance CIR. Specifically, we use a VLM (e.g., BLIP2) to address the scarcity of textual annotations in existing datasets by generating descriptive captions for both reference and target images. In addition, we enhance the modification text with contextual information using a VLM (e.g., MiniCPM), enriching the model's understanding of user intent. Then our method incorporates a Dual Reasoning Modification Module, which imposes bidirectional constraints by integrating both image and text modalities. Additionally, we introduce a Modality Shift Regularization Loss that assumes symmetry and correlation between text and image domain transformations in the latent space. This new loss function enforces consistent modality shifts, significantly enhancing the model's interpretative and generalization abilities. Experimental results on benchmark CIR datasets demonstrate that the proposed method achieves state-of-the-art (SOTA) performance. Our code and dataset will be available at https://github.com/kola1124/CR-CIR.git.
Loading