Image Retrieval with Composed Query by Multi-Scale Multi-Modal Fusion

Published: 01 Jan 2024, Last Modified: 15 May 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Image retrieval with composed query (IR-CQ) is a challenging task since it aims to retrieve the target image according to a hybrid-modality query which consists of a reference image and a text modifier. Previous approaches mainly focus on designing various multi-modal fusion modules to fuse the hybrid-modality query, but these fusion modules are often suboptimal without considering sufficient fusion between the two modalities. In this paper, we propose a general fusion block by taking three fusion strategies: weighted summing, concatenating, and bilinear pooling. Importantly, this general fusion block can be deployed to fuse not only the hybrid-modality query but also the multi-scale features of the reference image. Specifically, we first fuse the multi-scale features of the reference image with the Multi-Scale Fusion (MSF) block and then fuse the features of the reference image and text modifier with the Multi-Modal Fusion (MMF) block, where both MSF and MMF are instantiations of our general fusion block. Extensive experiments on three benchmark datasets show that our proposed model significantly outperforms existing approaches.
Loading