PersonalBench: A Human-Grounded Benchmark for LLM Personalization

Lechen Zhang; Tal August

PersonalBench: A Human-Grounded Benchmark for LLM Personalization

Lechen Zhang, Tal August

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: large language models, personalization, benchmark, user modeling

Abstract: Large language models (LLMs) are increasingly deployed as personal assistants, yet their ability to produce genuinely user-aligned responses remains unclear. Existing personalization benchmarks largely rely on synthetic conversations and fully LLM-dependent evaluation protocols, assuming that models can both simulate realistic user preferences and judge personalization quality. However, it is unclear whether LLM-generated personalization captures meaningful, user-specific behaviors beyond surface-level traits. To address this gap, we introduce PersonalBench, a benchmark grounded in real user behavior constructed from WildChat conversation histories. We develop an automated pipeline that extracts persistent user attributes from these histories and pairs each user with diverse prompts to identify which attributes are meaningfully relevant to specific user-prompt contexts. Through this process, we uncover a critical bottleneck: selecting contextually relevant attributes is surprisingly difficult for LLMs, with model selections showing poor agreement with those of real users. This finding motivates us to incorporate human annotators into the loop, revealing a fundamental gap between superficial lexical personalization and personalization that users genuinely find comfortable and appropriate. Our benchmark is designed to diagnose where in the personalization pipeline models fail, covering attribute extraction, relevance selection, and response generation. Evaluations on state-of-the-art models will quantify failure rates at each stage and identify systematic patterns, such as over-retrieval of irrelevant attributes and under-adaptation in emotionally sensitive contexts, to guide future model development.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 174

Loading