FashionERN: Enhance-and-Refine Network for Composed Fashion Image Retrieval

Published: 24 Mar 2024, Last Modified: 29 Sept 2024AAAI 2024EveryoneRevisionsCC BY 4.0
Abstract: The goal of composed fashion image retrieval is to locate a target image based on a reference image and modified text. Recent methods utilize symmetric encoders (e.g., CLIP) pre- trained on large-scale non-fashion datasets. However, the in- put for this task exhibits an asymmetric nature, where the reference image contains rich content while the modified text is often brief. Therefore, methods employing symmet- ric encoders encounter a severe phenomenon: retrieval re- sults dominated by reference images, leading to the over- sight of modified text. We propose a Fashion Enhance-and- Refine Network (**FashionERN**) centered around two aspects: enhancing the text encoder and refining visual semantics. We introduce a Triple-branch Modifier Enhancement model, which injects relevant information from the reference image and aligns the modified text modality with the target im- age modality. Furthermore, we propose a Dual-guided Vi- sion Refinement model that retains critical visual information through text-guided refinement and self-guided refinement processes. The combination of these two models significantly mitigates the reference dominance phenomenon, ensuring ac- curate fulfillment of modifier requirements. Comprehensive experiments demonstrate our approach’s state-of-the-art per- formance on four commonly used datasets.
Loading