MTSTRec: Multimodal Time-Aligned Shared Token Recommender

Ming-Yi Hong; Yen-Jung Hsu; Miao-Chen Chiang; Che Lin

MTSTRec: Multimodal Time-Aligned Shared Token Recommender

Ming-Yi Hong, Yen-Jung Hsu, Miao-Chen Chiang, Che Lin

Published: 01 May 2025, Last Modified: 05 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose MTSTRec, a transformer-based model that temporally aligns multimodal information in sequential recommendation, achieving state-of-the-art performance without relying on private user data.

Abstract: Sequential recommendation in e-commerce utilizes users' anonymous browsing histories to personalize product suggestions without relying on private information. Existing item ID-based methods and multimodal models often overlook the temporal alignment of modalities like textual descriptions, visual content, and prices in user browsing sequences. To address this limitation, this paper proposes the Multimodal Time-aligned Shared Token Recommender (MTSTRec), a transformer-based framework with a single time-aligned shared token per product for efficient cross-modality fusion. MTSTRec preserves the distinct contributions of each modality while aligning them temporally to better capture user preferences. Extensive experiments demonstrate that MTSTRec achieves state-of-the-art performance across multiple sequential recommendation benchmarks, significantly improving upon existing multimodal fusion. Our code is available at https://github.com/idssplab/MTSTRec.

Lay Summary: Online shopping platforms often recommend products to users based on their browsing history, but traditional methods usually rely only on product IDs and ignore important information from other modalities like images, text descriptions, and prices. Even when multimodal data is included, the timing of these elements—like when a user views an image or reads a description—is often misaligned, leading to less accurate recommendations. To solve this, we developed MTSTRec, a new recommendation model that aligns these different types of information, like images, text, and prices, over time. This allows the system to better understand user behavior and make more accurate suggestions. Our experiments show that MTSTRec consistently outperforms other recommendation models, leading to more relevant product suggestions for users. This improvement could make online shopping experiences faster and more personalized, benefiting both customers and retailers. In the future, we plan to adapt MTSTRec for broader applications, unlocking its potential to enhance user experiences and decision-making in multiple domains.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/idssplab/MTSTRec

Primary Area: Deep Learning->Sequential Models, Time series

Keywords: Multimodal Sequential Recommendation, Time-aligned Shared Token, Image Style Representation, Large Language Model

Submission Number: 5365

Loading