STAR : Semantic-ID Token-Embedding Alignment For Generative Recommenders

ICLR 2026 Conference Submission15160 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Generative Recommendation System; LLM Post-training; Semantic ID; Token Embedding
Abstract: Generative recommenders (GRs)—which directly generate the next-item semantic ID with an autoregressive model—are rapidly gaining adoption in research and large-scale production as a scalable, efficient alternative to traditional recommendation algorithms. Yet we find a fundamental failure mode when adapting Language Models (LMs) to GRs. We identify, for the first time, a pervasive token–embedding misalignment issue: the common mean-of-vocabulary initialization places new Semantic-ID tokens on the LM manifold but collapses their distinctions, stripping item-level semantics and degrading data efficiency and retrieval quality. We introduce **STAR**, a lightweight alignment stage that freezes the LM and updates *only* Semantic-ID embeddings via paired supervision from item titles/descriptions ↔ Semantic-ID, thereby injecting the new tokens with linguistically grounded, item-level semantics while preserving the pretrained model’s capabilities and the primary recommendation objective. Across multiple datasets and strong baselines, **STAR** consistently improves top-*k* retrieval/search performance over mean-of-vocabulary initialization and status-quo auxiliary-task adaptation. Ablations and analyses corroborate our claims, showing increased token-level diversity, stronger linguistic grounding, and improved sample efficiency. **STAR** is parameter-efficient, updating only the Semantic-ID token embeddings ($|\mathcal{V}_{\mathrm{SemID}}|\times D$ parameters), and integrates seamlessly with standard GR pipelines.
Supplementary Material: pdf
Primary Area: generative models
Submission Number: 15160
Loading