A fine-grained vision and language representation framework with graph-based fashion semantic knowledge

Published: 01 Jan 2023, Last Modified: 13 Nov 2024Comput. Graph. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•This paper proposes a novel framework to achieve a fine-grained vision and language representation in the fashion domain.•Specifically, we construct a knowledge-dependency graph structure from fashion descriptions and then aggregate it with word-level embedding, which can strengthen the fashion semantic knowledge and obtain fine-grained textual representations.•Moreover, we fine-tune a region-aware fashion segmentation network to capture region-level visual features, and then introduce local vision and language contrastive learning for pulling closer the fine-grained textual representations to the region-level visual features in the same garment.•Extensive experiments on downstream tasks, including cross-modal retrieval, category/subcategory recognition, and text-guided image retrieval, demonstrate the superiority of our method over state-of-the-art methods.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview