Abstract: In recent years, the rapid growth of online multimedia services, such as e-commerce platforms, has necessitated the development of personalised recommendation approaches that can encode diverse content about each item. Indeed, modern multi-modal recommender systems exploit diverse features obtained from raw images and item descriptions to enhance the recommendation performance. However, the existing multi-modal recommender systems primarily depend on the features extracted individually from different media through pre-trained modality-specific encoders, and exhibit only shallow alignments between different modalities, thereby limiting these systems’ ability to capture the underlying relationships between the modalities. In this article, we enhance the deep alignment of large multi-modal encoders to address the shallow alignment of modalities in multi-modal recommender systems. These encoders have previously demonstrated state-of-the-art effectiveness in ranking items across various domains. Specifically, we investigate the use of three state-of-the-art large multi-modal encoders – CLIP (dual-stream), VLMo and BEiT-3 (unified) – for recommendation tasks. We explore their benefits for recommendation through using a range of strategies, including the use of pre-trained and fine-tuned encoders, as well as the evaluation of the end-to-end training of these encoders. We show that pre-trained large multi-modal encoders generate more aligned and effective user/item representations compared with existing modality-specific encoders across four existing multi-modal recommendation datasets. Furthermore, we show that fine-tuning these encoders further improves the recommendation performance, with end-to-end training emerging as the most effective paradigm, significantly outperforming both pre-trained and fine-tuned encoders with an improved recommendation performance. We also demonstrate the effectiveness of large multi-modal encoders in facilitating modality alignment by evaluating the contribution of each modality separately. Finally, we show that the dual-stream approach, specifically CLIP, is the most effective architecture for these large multi-modal encoders, outperforming the unified approaches (i.e., VLMo and BEiT3) in terms of effectiveness and efficiency.
External IDs:dblp:journals/tors/YiLOMM25
Loading