Modality-Disentangled Feature Extraction via Knowledge Distillation in Multimodal Recommendation Systems

Haibing Hu, Yangyi Xie, Defu Lian, Kai Han

Published: 2025, Last Modified: 22 Jan 2026IEEE Trans. Comput. Soc. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multimodal recommendation enhances item representation in recommendation systems by integrating diverse modalities of item information beyond traditional ID-based features. This approach utilizes supplementary item details, including images, text, videos, and audio, to refine the accuracy of item representation and thereby boost the precision of recommendations. Multimodal recommendation has emerged as a vibrant field within the scope of systems that generate suggestions. It offers a powerful approach to address the challenges of data scarcity and the representation of long-tail content, thereby improving the overall quality of recommendations. However, the advancement in multimodal recommendation is currently hindered by two main obstacles. First, the process of extracting multimodal features from pre-trained models using either shallow or deep neural networks often results in insufficient data extraction or sparse recommendation data, leading to suboptimal model performance. Second, a significant portion of previous research has focused on integrating information across modalities, often overlooking the distinct characteristics inherent in different modalities. Addressing these challenges, we introduce a unique methodology titled “modality-disentangled feature extraction via knowledge distillation in multimodal recommendation systems” (MODEST). First, to tackle the aforementioned problems that arise when extracting multimodal features with either deep or shallow neural networks, our approach adopts a teacher–student network framework. In this framework, deep neural networks are utilized to extract representation vectors from text and image data. Feature fusion is then carried out via attention mechanisms, and semantic labels are employed as classification labels to derive three supervised learning loss functions. This process significantly enhances the teacher network's capacity to extract multimodal features. Subsequently, the knowledge from the teacher network is transferred to the student network through knowledge distillation. The student network makes use of a shallow neural network, and during the inference stage, we rely on the student network. This strategy effectively resolves the issues of data sparsity in deep networks and insufficient information extraction in shallow networks. Second, to more effectively capture the similarities and distinct features among different modalities, we implement a disentangled modality decomposition technique. Through integrated mappings, it separately extracts text and image information from the teacher–student networks and decomposes them into cross-modality common information and cross-modality specific information. By applying the constraints of contrastive learning, we minimize the distance between cross-modality common information and maximize the separation of cross-modality specific information, promoting convergence with the aid of auxiliary loss. This effectively addresses the problem of cross-modality feature alignment. Lastly, we combine the recommendation loss function with the multiple loss constraints we have added to formulate a unified optimization objective function. To underscore the remarkable efficacy of our proposed model, we have executed comprehensive experiments and visualizations on several real-world datasets. The results distinctly show a significant enhancement in our model's performance, allowing it to achieve a level of competitiveness against other methods.

External IDs:dblp:journals/tcss/HuXLH25