PROCAT: Product Catalogue Dataset for Implicit Clustering, Permutation Learning and Structure PredictionDownload PDF

Published: 29 Jul 2021, Last Modified: 24 May 2023NeurIPS 2021 Datasets and Benchmarks Track (Round 1)Readers: Everyone
Keywords: set-to-sequence, structure prediction, product catalog, product catalogue, product brochure, permutation learning
TL;DR: A set-to-sequence dataset for complex structure prediction based on 1.5M items in real product catalogues, with metrics, benchmarks, and synthetic data generation library.
Abstract: In this dataset paper we introduce PROCAT, a novel e-commerce dataset containing expertly designed product catalogues consisting of individual product offers grouped into complementary sections. We aim to address the scarcity of existing datasets in the area of set-to-sequence machine learning tasks, which involve complex structure prediction. The task's difficulty is further compounded by the need to place into sequences rare and previously-unseen instances, as well as by variable sequence lengths and substructures, in the form of diversely-structured catalogues. PROCAT provides catalogue data consisting of over 1.5 million set items across a 4-year period, in both raw text form and with pre-processed features containing information about relative visual placement. In addition to this ready-to-use dataset, we include baseline experimental results on a proposed benchmark task from a number of joint set encoding and permutation learning model architectures.
URL: and
Supplementary Material: zip
Contribution Process Agreement: Yes
Dataset Url:
License: The data is made publicly available under the Attribution-NonCommercial-ShareAlike 4.0 International license (CC BY-NC-SA 4.0). The dataset should not be used for commercial purposes.
Author Statement: Yes
12 Replies