Dynamic Network for Language-based Fashion Retrieval

Published: 01 Jan 2023, Last Modified: 13 Jul 2025MMIR@MM 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Language-based fashion image retrieval, as a kind of composed image retrieval, presents a substantial challenge in the domain of multi-modal retrieval. This task aims to retrieve the target fashion item in the gallery given a reference image and a modification text. Existing approaches primarily concentrate on developing a static multi-modal fusion module to learn the combined semantics of the reference image and modification text. Despite their commendable advancements, these approaches are still limited by a deficiency in flexibility, which is attributed to the application of a singluar fusion module across diverse input queries. In contrast to static fusion methods, we propose a novel method termed Dynamic Fusion Network (DFN) to compose the multi-granularity features dynamically by considering the consistency of routing path and modality-specific information simultaneously. In specific, our proposed method is consisted of two modules: (1) Dynamic Network. The dynamic network enables a flexible combination of the different operation modules, providing multi-granularity modality interaction for each reference image and modifier text. (2) Modality Specific Routers (MSR). The MSR generates precise routing decisions based on the distinct semantics and distributions of each reference image and modifier text. Extensive experiments on three benchmarks,\textiti.e., FashionIQ, Shoes and Fashion200K, demonstrate the effectiveness of our proposed model compared with the existing methods.
Loading