MENTOR: Multi-level Self-supervised Learning for Multimodal Recommendation

Published: 01 Jan 2025, Last Modified: 31 Jul 2025AAAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: As multimedia information proliferates, multimodal recommendation systems have garnered significant attention. These systems leverage multimodal information to alleviate the data sparsity issue inherent in recommendation systems, thereby enhancing the accuracy of recommendations. Due to the natural semantic disparities among multimodal features, recent research has primarily focused on cross-modal alignment using self-supervised learning to bridge these gaps. However, aligning different modal features might result in the loss of valuable interaction information, distancing them from ID embeddings. It is crucial to recognize that the primary goal of multimodal recommendation is to predict user preferences, not merely to understand multimodal content. To this end, we propose a new Multi-level sElf-supervised learNing for mulTimOdal Recommendation (MENTOR) method, which effectively reduces the gap among modalities while retaining interaction information. Specifically, MENTOR begins by extracting representations from each modality using both heterogeneous user-item and homogeneous item-item graphs. It then employs a multilevel cross-modal alignment task, guided by ID embeddings, to align modalities across multiple levels while retaining historical interaction information. To balance effectiveness and efficiency, we further propose an optional general feature enhancement task that bolsters the general features from both structure and feature perspectives, thus enhancing the robustness of our model.
Loading