Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

Bowen Jiang; Zhuoqun Hao; Young Min Cho; Bryan Li; Yuan Yuan; Sihao Chen; Lyle Ungar; Camillo Jose Taylor; Dan Roth

Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

Bowen Jiang, Zhuoqun Hao, Young Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo Jose Taylor, Dan Roth

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: personalization, chatbot, large language models, memory, long context

TL;DR: We introduce a state-of-the-art personalization benchmark for evaluating large language models in long contexts.

Abstract: Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks. Over time, the interaction history between a user and an LLM can provide extensive information about an individual's traits and preferences. However, open questions remain on how well LLMs today can effectively leverage such history to (1) internalize the user's inherent traits and preferences, (2) track how the user profiling and preferences evolve over time, and (3) generate personalized responses accordingly in new scenarios. In this work, we introduce the PersonaMem benchmark. PersonaMem features curated user profiles with over 180 simulated user-LLM interaction histories, each containing up to 60 sessions of multi-turn conversations across 15 real-world tasks. We observe that current LLMs still struggle to deliver responses that align with users' current situations and preferences, with frontier models such as GPT-4.1, GPT-4.5, o4-mini, or Gemini-2.0 achieving only around 50% overall accuracy.

Submission Number: 143

Loading