Abstract: The evolution of large vision-language models (LVLMs) has shed light on the development of many fields, particularly for multimodal recommendation. While LVLMs offer an integrated understanding of textual and visual information of items from user interactions, their deployment in this domain remains limited due to inherent complexities. First, LVLMs are trained from enormous general datasets and lack knowledge of personalized user preferences. Second, LVLMs struggle with multiple image processing, especially with discrete, noisy, and redundant images in recommendation scenarios. To address these issues, we introduce a new reasoning strategy called Visual-Summary Thought (VST) for Multimodal Recommendation. This approach begins by prompting LVLMs to generate textual summaries of item images, which serve as contextual information. These summaries are then combined with item titles to enhance the representation of sequential interactions and improve the ranking of candidates. Our experiments, conducted across four datasets using three different LVLMs: GPT4-V, LLaVA-7b, and LLaVA-13b validate the effectiveness of VST.
Loading