FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets

02 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: semantic identifiers, industry dataset, generative retrieval
TL;DR: We propose FORGE, a comprehensive benchmark for forming semantic identifiers in generative retrieval with industrial datasets.
Abstract: Semantic identifiers (SIDs) have gained increasing attention in generative retrieval (GR) due to their meaningful semantic discriminability. However, current research on SIDs faces three main challenges: (1) the absence of large-scale public datasets with multimodal features, (2) limited investigation into the generation strategies for better SIDs, whose assessment typically relies on costly GR training, and (3) slow online convergence in industrial deployment. To address these challenges, we propose **FORGE**, a comprehensive benchmark for **FO**rming semantic identifie**R**s for **G**enerative r**E**trieval in industrial datasets. Specifically, FORGE is equipped with a dataset comprising **14 billion** user interactions and multimodal features of **250 million** items sampled from one of the biggest e-commerce platforms in China, which serves over 300 million users each day. Leveraging this dataset, FORGE examines the impacts of SID construction on recommendations from multiple perspectives and validates their influence via offline experiments across different settings and tasks. Further online studies conducted on our platform for homepage recommendations show a 0.35% increase in transaction count, highlighting its practical impact. Regarding the expensive SID validation accompanied by full training of GRs, we propose two novel metrics of SID that correlate positively with the recommendation performance, enabling convenient evaluations without any GR training. For real-world applications, FORGE introduces an offline pretraining schema that reduces online convergence by half of the original. The code and data are available at https://anonymous.4open.science/r/forge.
Primary Area: datasets and benchmarks
Submission Number: 869
Loading