Keywords: Large Language Models, Multimodal Models, Personal LLM Agents
Abstract: With the increasing use of smartphones, users often take photos to quickly capture and store information. This multimodal personalized data offers a promising research direction for developing smartphone AI assistants. In this paper, we introduce a new task called multimodal personalized retrieval (MPR) within this context. The MPR task takes a user’s text query as input and retrieves images that match the user’s search intent. The task presents three key challenges: 1) effective management of personal data, 2) handling the quality of user queries, and 3) the need for a lightweight model architecture that can operate on personal devices. To address them, we propose GAMER, which enhances multimodal retrieval by leveraging LLM-driven query refinement and RLHF to optimize end-to-end performance. Extensive experiments demonstrate a 13.2% improvement over SOTA baselines. Moreover, GAMER has been deployed in real products, resulting in an improved user experience.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15497
Loading