Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: personalization, chatbot, large language models, memory, long context
TL;DR: We introduce a state-of-the-art personalization benchmark for evaluating large language models in long contexts.
Abstract: Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks. Over time, the interaction history between a user and an LLM can provide extensive information about an individual's traits and preferences. However, open questions remain on how well LLMs today can effectively leverage such history to (1) internalize the user's inherent traits and preferences, (2) track how the user profiling and preferences evolve over time, and (3) generate personalized responses accordingly in new scenarios. In this work, we introduce the PersonaMem benchmark. PersonaMem features curated user profiles with over 180 simulated user-LLM interaction histories, each containing up to 60 sessions of multi-turn conversations across 15 real-world tasks. We observe that current LLMs still struggle to deliver responses that align with users' current situations and preferences, with frontier models such as GPT-4.1, GPT-4.5, o4-mini, or Gemini-2.0 achieving only around 50% overall accuracy.
Submission Number: 143
Loading