Abstract: With the rapid development of e-commerce platforms and the continuous influx of massive clothing data such as the Internet and mobile apps, clothing has become an indispensable way of consumption in people’s daily lives. Clothes image retrieval is especially useful when the user is interested in a cloth that emerged in social media videos (Instagram, TikTok), and this is a so-called video-to-shop retrieval task, where the target clothes and query clothes dressed in a short video are within different domains. In this paper, to handle the above problem, we propose a cross-domain clothing retrieval method based on massive data in the context of e-commerce live streaming. Our method includes two components: object detection and re-identification (Re-ID). The object detection branch is an anchor-free detector. The Re-ID branch is pre-trained by large-scale image-text pairs and prompts to learn joint embedding space, cross-domain, and fine-grained information for improving retrieval performance. This network enhances visual representation by integrating prompt learning and capitalizing on the cross-modal descriptive capability of FashionCLIP. We conducted experiments on a large-scale multimodal Watch and Buy dataset and achieved good results.
Loading