Multi-granular Semantic Mining for Composed Image Retrieval

Published: 01 Jan 2024, Last Modified: 13 Nov 2024ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Composed Image Retrieval(CIR) aims to model users’ query intention with multiple modalities and retrieve the desired images from a large image corpus. The biggest challenge is how to effectively integrate the semantic information between two different modalities. A popular solution is to design attention-based modules to extract the query embedding in a coarse manner, which leads to certain confusion about search intention. To address this problem, we propose a new method for query integration, which is composed of two key modules, i.e., Multi-granular Subspace Fusion (MSF) and Residual Regression (RR) constraint. Specifically, MSF focuses on mining cross-modal semantic dependency between reference image regions and modification text pieces in multi-granular subspaces, which can construct an implicit, holistic semantic relationship in a fine manner. And RR constraint pushes the visual-text semantic alignment under specific supervision. Extensive experiments on three prevalent datasets demonstrate the state-of-the-art performance of our method.
Loading