Deeply Fusing Semantics and Interactions for Item Representation Learning via Topology-driven Pre-training
Abstract: Learning item representation is crucial for a myriad of on-line e-commerce applications.
The nucleus of retail item representation learning is how to properly fuse the semantics within a single item, and the interactions across different items generated by user behaviors (e.g., co-click or co-view).
Product semantics depict the intrinsic characteristics of the item, while the interactions describe the relationships between items from the perspective of human perception.
Existing approaches either solely rely on a single type of information or loosely couple them together, leading to hindered representations.
In this work, we propose a novel model named TESPA to reinforce semantic modeling and interaction modeling mutually.
Specifically, collaborative filtering signals in the interaction graph are encoded into the language models through fine-grained topological pre-training, and the interaction graph is further enriched based on semantic similarities.
After that, a novel multi-channel co-training paradigm is proposed to deeply fuse the semantics and interactions under a unified framework.
In a nutshell, TESPA is capable of enjoying the merits of both sides to facilitate item representation learning.
Experimental results of on-line and off-line evaluations demonstrate the superiority of our proposal.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Relevance To Conference: This work introduces TESPA, a novel framework designed to significantly advance multimedia and multimodal processing in online e-commerce by effectively leveraging item semantics (i.e., title, description, price, and category) and user behavior interactions (i.e., co-click or co-view). Unlike traditional methods focusing on unimodal data or loosely integrating multimodal inputs, TESPA intricately fuses semantic information from items' multiple attributes with user behaviors. This approach enables the mutual enhancement of textual semantics and contextual interactions, which is essential for multimedia and multimodal processing.
Supplementary Material: zip
Submission Number: 5214
Loading