Abstract: Multimodal recommendation aims to predict users’ future behaviors based on historical interaction data and item multimodal information. Previous studies face two main issues. First, due to the complexity and high dimensionality of multimodal features, existing methods fail to effectively extract them, and their rich semantic information is often overlooked. Second, false positive and negative noise in user behavior data interferes with preference modeling-especially in Graph Convolutional Network(GCN)-based models, where such noise propagates through the graph and affects node representations. To address these challenges, we propose a knowledge distillation-based collaborative graph diffusion multimodal recommendation model (DiffKD). Specifically, DiffKD adopts a teacher-student framework, where the teacher extracts semantically rich multimodal features and transfers knowledge to the student via a transfer loss. To reduce noise while retaining key interactions, a diffusion model adds and removes noise to generate a user-item graph, which is fused with the original to form a collaborative graph. Finally, a multimodal feature encoder enhances user and item representations by combining high-order collaborative signals from the user-item graph with semantic relationships derived from the item-item graph. Extensive experiments on four public datasets show that DiffKD outperforms the strongest baseline by an average of 6.78%.
External IDs:dblp:journals/jiis/MaXL25
Loading