Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

The Composed Image Retrieval (CIR) task aims to retrieve target images using a composed query consisting of a reference image and a modified text. Advanced methods often utilize contrastive learning as the optimization objective, which benefits from adequate positive and negative examples. However, the triplet for CIR incurs high manual annotation costs, resulting in limited positive examples. Furthermore, existing methods commonly use in-batch negative sampling, which reduces the negative number available for the model. To address the problem of lack of positives, we propose a data generation method by leveraging a multi-modal large language model to construct triplets for CIR. To introduce more negatives during fine-tuning, we design a two-stage fine-tuning framework for CIR, whose second stage introduces plenty of static representations of negatives to optimize the representation space rapidly. The above two improvements can be effectively stacked and designed to be plug-and-play, easily applied to existing CIR models without changing their original architectures. Extensive experiments and ablation analysis demonstrate that our method effectively scales positives and negatives and achieves state-of-the-art results on both FashionIQ and CIRR datasets. In addition, our methods also perform well in zero-shot composed image retrieval, providing a new CIR solution for the low-resources scenario. The code is released at https://anonymous.4open.science/r/45F4 and will be publicly available upon acceptance.

Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Engagement] Multimedia Search and Recommendation
Relevance To Conference: The Composed Image Retrieval (CIR) task retrieves target images via a given multi-modal query that includes a reference image and a modification sentence that changes specific attributes in the reference image. CIR is an important task in multimedia search. Our work introduces a plug-and-play method to enhance contrastive learning performance in CIR. Initially, we leverage a multimodal large language model to generate a large number of positive examples for the CIR task. Subsequently, we utilize a memory bank approach to augment the number of negative examples, thereby improving the performance of CIR. Our work aligns with this vision by delving into the intricacies of Composed Image Retrieval, contributing novel methodologies and insights that enhance the understanding and effectiveness of multimodal information retrieval. Through innovative approaches and rigorous experimentation, we advance the state-of-the-art in CIR, addressing challenges and pushing boundaries in this dynamic field. By presenting our findings at the MM conference, we aim to engage with fellow researchers, share our contributions, and collaborate towards the ongoing evolution of multimedia and multimodal processing technologies.
Supplementary Material: zip
Submission Number: 244
Loading