Visual Summary Thought of Large Vision-Language Models for Multimodal Recommendation

Yuqing Liu, Yu Wang, Yuwei Cao, Lichao Sun, Philip S. Yu

Published: 15 Dec 2024, Last Modified: 12 Dec 20242024 IEEE International Conference on Big Data (IEEE BigData 2024)EveryoneCC0 1.0

Abstract: The evolution of large vision-language models (LVLMs) has shed light on the development of many fields, particularly for multimodal recommendation. While LVLMs offer an integrated understanding of textual and visual information of items from user interactions, their deployment in this domain remains limited due to inherent complexities. First, LVLMs are trained from enormous general datasets and lack knowledge of personalized user preferences. Second, LVLMs struggle with multiple image processing, especially with discrete, noisy, and redundant images in recommendation scenarios. To address these issues, we introduce a new reasoning strategy called Visual-Summary Thought (VST) for Multimodal Recommendation. This approach begins by prompting LVLMs to generate textual summaries of item images, which serve as contextual information. These summaries are then combined with item titles to enhance the representation of sequential interactions and improve the ranking of candidates. Our experiments, conducted across four datasets using three different LVLMs: GPT4-V, LLaVA-7b, and LLaVA-13b validate the effectiveness of VST.