CLIP-Based Composed Image Retrieval with Comprehensive Fusion and Data AugmentationOpen Website

Published: 01 Jan 2023, Last Modified: 19 Mar 2024AI (1) 2023Readers: Everyone
Abstract: Composed image retrieval (CIR) is a challenging task where the input query consists of a reference image and its corresponding modification text. Recent methodologies harness the prowess of visual-language pre-training models, i.e., CLIP, yielding commendable performance in CIR. Despite their promise, several shortcomings linger. First, a salient domain discrepancy between the CLIP’s pre-training data and the CIR’s training data leads to suboptimal feature representation. Second, the existing multimodal fusion mechanisms solely rely on weighted summing and feature concatenation, neglecting the intricate higher-order interactions inherent in the multimodal query. This oversight poses challenges in modeling complex modification intents. Additionally, the paucity of data impedes model generalization. To address these issues, we propose a CLIP-based composed image retrieval model with comprehensive fusion and data augmentation (CLIP-CD), consisting of two training stages. In the first stage, we fine-tune both the image and text encoders of CLIP to alleviate the aforementioned domain discrepancy. In the second stage, we propose a comprehensive multimodal fusion module that enables the model to discern complex modification intentions. Furthermore, we propose a similarity-based data augmentation method for CIR, ameliorating data scarcity and enhancing the model’s generalization ability. Experimental results on the Fashion-IQ dataset demonstrate the effectiveness of our method.
0 Replies

Loading