Unifying Isolated Processes for Enhanced Multi-modal Recommendations Using a Graph Transformer

Zixuan Yi, Iadh Ounis

Published: 29 Aug 2025, Last Modified: 19 Dec 2025ACM Transactions on Recommender SystemsEveryoneRevisionsCC BY-SA 4.0

Abstract: With the rapid development of online multimedia services, especially in e-commerce platforms, there is a pressing need for personalised recommender systems that can effectively encode the diverse multi-modal content associated with each item. However, we argue that existing top-k multi-modal recommender systems typically use isolated processes for both feature extraction and modality encoding. Such isolated processes can harm the recommendation performance. Firstly, an isolated extraction process underestimates the importance of effective feature extraction in multi-modal recommendation, potentially incorporating non-relevant information, which is harmful to item representations. Second, an isolated modality encoding process produces disjoint embeddings for item modalities due to the individual processing of each modality, which leads to a suboptimal fusion of user/item representations for an effective user preferences prediction. We hypothesise that the use of a unified model for addressing both aforementioned isolated processes will enable the consistent extraction and cohesive fusion of joint multi-modal features, thereby enhancing the effectiveness of multi-modal recommender systems. In this paper, we propose a novel model, called Unified multi-modal Graph Transformer (UGT), which firstly leverages a multi-way transformer to extract aligned multi-modal features from raw data for top-k recommendation. Subsequently, we build a unified graph neural network in our UGT model to jointly fuse the multi-modal user/item representations derived from the output of the multi-way transformer. Using the graph transformer architecture of our UGT model, we show that the UGT model achieves significant effectiveness gains, especially when jointly optimised with the commonly used recommendation losses. Our extensive experiments on three benchmark datasets show that our proposed UGT model consistently outperforms 13 strong recommendation approaches, ranging from established to state-of-the-art, achieving up to a 13.97% improvement over the best baseline. In addition, we demonstrate that UGT effectively enhances modality fusion by significantly improving the contribution of each modality in the multi-modal recommendation task. We also show that UGT leverages its pre-trained multi-modal knowledge as auxiliary information to enhance the recommendation performance for cold-start users. Furthermore, we present a case study illustrating how our UGT model qualitatively recommends more useful and semantically relevant items to users compared to the best-performing baseline, namely FREEDOM. Finally, we demonstrate that UGT exhibits a strong out-of-domain performance in a micro-video recommendation scenario.

External IDs:doi:10.1145/3760762