A fine-grained vision and language representation framework with graph-based fashion semantic knowledge
Abstract: Highlights•This paper proposes a novel framework to achieve a fine-grained vision and language representation in the fashion domain.•Specifically, we construct a knowledge-dependency graph structure from fashion descriptions and then aggregate it with word-level embedding, which can strengthen the fashion semantic knowledge and obtain fine-grained textual representations.•Moreover, we fine-tune a region-aware fashion segmentation network to capture region-level visual features, and then introduce local vision and language contrastive learning for pulling closer the fine-grained textual representations to the region-level visual features in the same garment.•Extensive experiments on downstream tasks, including cross-modal retrieval, category/subcategory recognition, and text-guided image retrieval, demonstrate the superiority of our method over state-of-the-art methods.
Loading