Abstract: With the surge in multimedia applications and the explosion of multimedia information, multimodal recommendation has garnered increasing attention. The core idea of multimodal recommendation is to effectively utilize the multimodal information of items to enhance the performance of recommendation systems. Previous work has primarily focused on integrating item ID embeddings and multimodal representations. However, there are two main drawbacks: first, a lack of interaction between multimodal features, which neglects the inherent semantic relationships within these features. Second, the distribution gap between ID embeddings and multimodal representations adversely affects recommendation performance. In this paper, we propose a novel Multi-stage Interactive Network for multimodal recommendation, named MIN. Specifically, we first discarded the ID embeddings, retaining only the multimodal features to represent the items. Then, we designed a three-stage interaction network to model the multimodal representations and the latent structural semantics between users and items. Additionally, to obtain more general and robust representations, we applied graph enhancement techniques in the final stage to enhance the representations of users and items. Finally, contrastive learning techniques were employed to address the inconsistency in user and item representations. A series of comprehensive experiments on three publicly available datasets have validated the efficacy of our method.
Loading