Abstract: Personalized recommendation systems have become integral in this digital age by facilitating content discovery to users and products tailored to their preferences. Since the Generative Artificial Intelligence (GAI) boom, research into GAI-enhanced Conversational Recommender Systems (CRSs) has sparked great interest. Most existing methods, however, mainly rely on one mode of input such as text, thereby limiting their ability to capture content diversity. This is also inconsistent with real-world scenarios, which involve multi-modal input data and output data. To address these limitations, we propose the Multi-Modal Conversational Recommender System (MMCRec) model which harnesses multiple modalities, including text, images, voice and video to enhance the recommendation performance and experience. Our model is capable of not only accepting multi-mode input, but also generating multi-modal output in conversational recommendation. Experimental evaluations demonstrate the effectiveness of our model in real-world conversational recommendation scenarios.
Loading