A fine-grained vision and language representation framework with graph-based fashion semantic knowledge

Huiming Ding; Sen Wang; Zhifeng Xie; Mengtian Li; Lizhuang Ma

A fine-grained vision and language representation framework with graph-based fashion semantic knowledge

Huiming Ding, Sen Wang, Zhifeng Xie, Mengtian Li, Lizhuang Ma

Published: 01 Jan 2023, Last Modified: 31 Jul 2025Comput. Graph. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•This paper proposes a novel framework to achieve a fine-grained vision and language representation in the fashion domain.•Specifically, we construct a knowledge-dependency graph structure from fashion descriptions and then aggregate it with word-level embedding, which can strengthen the fashion semantic knowledge and obtain fine-grained textual representations.•Moreover, we fine-tune a region-aware fashion segmentation network to capture region-level visual features, and then introduce local vision and language contrastive learning for pulling closer the fine-grained textual representations to the region-level visual features in the same garment.•Extensive experiments on downstream tasks, including cross-modal retrieval, category/subcategory recognition, and text-guided image retrieval, demonstrate the superiority of our method over state-of-the-art methods.

Loading