Abstract: The goal of product mapping is to decide, whether two listings from two different e-shops describe the same products. Existing datasets of matching and non-matching pairs of products, however, often suffer from incomplete product information or contain only very distant non-matching products. In this paper, we introduce two new datasets for product mapping: ProMapCz consisting of 1,495 Czech product pairs and ProMapEn consisting of 1,555 English product pairs of matching and non-matching products manually scraped from two pairs of e-shops. The datasets contain both images and textual descriptions of the products, including their specifications, making them one of the most complete datasets for product mapping. Additionally, we divide the non-matching products into two different categories – close non-matches and medium non-matches, based on how similar the products are to each other. Even the medium non-matches are, however, pairs of products that are much more similar than non-matches in other datasets – for example, they still need to have the same brand and similar name and price. Finally, we train a number of product matching models on these datasets to demonstrate the advantages of having these two types of non-matches for the analysis of these models.
Loading