USER-LLM: User History Encoding and Compression for Multimodal Recommendation with Large Language Models

ACL ARR 2024 December Submission1546 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While large language models (LLMs) have proven effective in leveraging textual data for recommendations, their application to multimodal tasks involving visual content remains underexplored. Although LLM can comprehend multimodal content through a projection function that maps visual features into the semantic space of LLM, recommendation tasks often require representing users' history interactions through lengthy prompts combining text and visual elements, which not only hampers training and inference efficiency but also makes it difficult for the model to accurately capture user preferences from complex and extended prompts, leading to reduced recommendation performance. To address this challenge, we introduce USER-LLM, an innovative multimodal recommendation framework that integrates textual and visual features through a User History Encoding Module (UHEM), compressing multimodal user history interactions into a single token representation, effectively facilitating LLMs in processing user preferences. Extensive experiments demonstrate the effectiveness and efficiency of our proposed mechanism.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal information extraction;multimodality;
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 1546
Loading