Generation-Augmented Multimodal Retrieval in Personal LLM Agents

Zheng Wang; Xiaoneng Xiang; Shu Xian Teo; Jieer Ouyang; Wei Shi; Yangkai Ding

Generation-Augmented Multimodal Retrieval in Personal LLM Agents

Zheng Wang, Xiaoneng Xiang, Shu Xian Teo, Jieer Ouyang, Wei Shi, Yangkai Ding

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Multimodal Models, Personal LLM Agents

Abstract: With the increasing use of smartphones, users often take photos to quickly capture and store information. This multimodal personalized data offers a promising research direction for developing smartphone AI assistants. In this paper, we introduce a new task called multimodal personalized retrieval (MPR) within this context. The MPR task takes a user’s text query as input and retrieves images that match the user’s search intent. The task presents three key challenges: 1) effective management of personal data, 2) handling the quality of user queries, and 3) the need for a lightweight model architecture that can operate on personal devices. To address them, we propose GAMER, which enhances multimodal retrieval by leveraging LLM-driven query refinement and RLHF to optimize end-to-end performance. Extensive experiments demonstrate a 13.2% improvement over SOTA baselines. Moreover, GAMER has been deployed in real products, resulting in an improved user experience.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 15497

Loading