Keywords: Sequential recommendation, tokenization, representation learning, vector quantization
TL;DR: This paper presents a unified framework for recommendation systems that integrates semantic and ID tokenization, along with cosine similarity and Euclidean distance, to enhance prediction accuracy and reduce token redundancy.
Abstract: Effective recommendation is crucial for large-scale online advertising platforms, where understanding user-item interactions with limited exposure is a persistent challenge. Traditional systems rely heavily on ID tokens to uniquely represent items, capturing distinct associations but suffering from redundancy and poor generalization in cold-start settings. Semantic tokens, by contrast, encode transferable item attributes but often lead to duplication and inconsistent performance gains. To address these limitations, we propose a Causally-Informed Unified Semantic and ID Representation Learning framework that harnesses the complementary strengths of both token types. Our approach treats ID tokens as anchors for item-specific information while using semantic tokens to encode shared, generalizable features. To further enhance representation quality, we introduce a hybrid similarity mechanism—cosine similarity is applied in early layers to decouple over-smoothed embeddings, while Euclidean distance is used in the final layer to sharpen item discrimination. Importantly, we integrate causal learning principles to disentangle user exposure bias and improve robustness in recommendation scenarios, especially under data sparsity. Experiments on three benchmark datasets show that our method outperforms state-of-the-art baselines by 6%–17% and reduces token vocabulary size by over 80%. These results demonstrate the power of combining semantic and ID tokenization with causal learning to build more generalizable and effective recommendation systems. Code is available at: https://anonymous.4open.science/r/Unified_Semantic_ID-9E94.
Submission Number: 7
Loading