Real20M: A Large-scale E-commerce Dataset for Cross-domain Retrieval

Published: 26 Oct 2023, Last Modified: 29 Sept 2024ACM MM 2023EveryoneCC BY 4.0
Abstract: In e-commerce, products and micro-videos serve as two primary car- riers. Introducing cross-domain retrieval between these carriers can establish associations, thereby leading to the advancement of spe- cific scenarios, such as retrieving products based on micro-videos or recommending relevant videos based on products. However, ex- isting datasets only focus on retrieval within the product domain while neglecting the micro-video domain and often ignore the multi- modal characteristics of the product domain. Additionally, these datasets strictly limit their data scale through content alignment and use a content-based data organization format that hinders the inclu- sion of user retrieval intentions. To address these limitations, we pro- pose the PKU Real20M dataset, a large-scale e-commerce dataset designed for cross-domain retrieval. We adopt a query-driven ap- proach to efficiently gather over 20 million e-commerce products and micro-videos, including multimodal information. Addition- ally, we design a three-level entity prompt learning framework to align inter-modality information from coarse to fine. Moreover, we introduce the Query-driven Cross-Domain retrieval framework (QCD), which leverages user queries to facilitate efficient alignment between the product and micro-video domains. Extensive experi- ments on two downstream tasks validate the effectiveness of our proposed approaches.
Loading