Keywords: Vision and Language, Zero-Shot Compose Image Retreival
Abstract: Composed Image Retrieval (CIR) facilitates retrieving an image matching a reference image while incorporating specified textual modifications, which is crucial for internet searches and e-commerce. Traditional supervised CIR methods rely on annotated triplets, which are labor-intensive and limit generalizability. Recent advances in Zero-Shot Composed Image Retrieval (ZS-CIR) address the challenge of performing this task without annotated triplets. A key challenge in ZS-CIR is training models on limited intention-relevant datasets to understand human intention implicitly expressed in textual modifications for accurately retrieving target images. In this paper, we introduce an image-text dataset incorporated with pseudo-manipulation intentions to enhance the training of ZS-CIR models in understanding human manipulation intents. Based on our dataset, we propose a novel framework, De-MINDS, for capturing the intent humans aim to modify, thereby enhancing the ZS-CIR model's ability to understand human manipulation descriptions. Specifically, a simple mapping network first maps image information into language space and forms a target description with a manipulation description. Subsequently, De-MINDS captures intention-relevant information from target descriptions and converts them into several pseudo-word tokens for accurate ZS-CIR. The De-MINDS model exhibits robust generalization and significant improvements in performance across four ZS-CIR tasks. It achieves performance improvements from 2.05% to 4.35% over the best methods and establishes new state-of-the-art results with comparable inference times. Our code is available at https://anonymous.4open.science/r/De-MINDS/.
Supplementary Material: gz
Primary Area: Machine vision
Submission Number: 1089
Loading