Fashion Image Retrieval Based on Multimodal Features Enhancement and Fusion

Yingjin Li, Shufan He, Zhaojing Wang, jin huang, Xinrong Hu, Li Li

Published: 08 Mar 2025, Last Modified: 20 Mar 2026OpenReview Archive Direct UploadEveryoneRevisionsCC BY-NC-ND 4.0

Abstract: Fashion image retrieval (FIR) is of great interest due to its potential to enhance the convenience of online shopping. The exponential increase in online clothing image data has led to a corresponding rise in similar style data, challenging the accuracy of traditional FIR methods. To address this, we present a novel FIR framework leveraging the multimodal features enhancement and fusion model with SE-ConvNeXt-Text. In our approach, the feature extraction part of the ConvNeXt network serves as the image feature extraction module, with the addition of the Squeeze-and-Excitation (SE) attention mechanism to address insufficient feature extraction for similar styles. The text feature processing module is constructed to supplement image features by extracting text information on the image. Our method effectively fuses multimodal information from images and texts through a designed contrastive learning module, ensuring the accuracy of FIR. To validate the efficacy of our approach, we conducted experiments on two sub-tasks (In-Shop and Extended In-Shop dataset). The results demonstrate an average improvement of 3.76% and 3.51% in accuracy and precision compared to the state-of-art methods.