Abstract: Composed Image Retrieval(CIR) aims to model users’ query intention with multiple modalities and retrieve the desired images from a large image corpus. The biggest challenge is how to effectively integrate the semantic information between two different modalities. A popular solution is to design attention-based modules to extract the query embedding in a coarse manner, which leads to certain confusion about search intention. To address this problem, we propose a new method for query integration, which is composed of two key modules, i.e., Multi-granular Subspace Fusion (MSF) and Residual Regression (RR) constraint. Specifically, MSF focuses on mining cross-modal semantic dependency between reference image regions and modification text pieces in multi-granular subspaces, which can construct an implicit, holistic semantic relationship in a fine manner. And RR constraint pushes the visual-text semantic alignment under specific supervision. Extensive experiments on three prevalent datasets demonstrate the state-of-the-art performance of our method.
Loading