Keywords: LLM/AI agents, benchmarking, prompting
Abstract: Large Language Models have advanced autonomous agents, but personalization remains essential for agents to be practically useful.
To measure this ability, recent benchmarks aim to evaluate personalization in agents. However, they either provide static preference snapshots or fixed interaction logs, or they evaluate personalization mainly through question answering over retrieved profiles. These designs under-represent the complexity of real preferences in dialogue histories and fail to assess preference-conditioned task execution, thereby obscuring a critical knowing-doing gap. To address this, we introduce PersonaKAG, a benchmark for implicit behavioral alignment built from longitudinal interaction histories that contain noise, implicit cues, and temporal inconsistencies. PersonaKAG evaluates whether an agent can execute tasks while satisfying implicit constraints inferred from history, rather than only answering preference questions.
We further propose SynRPG, a framework that combines broad retrieval with trajectory-level alignment to resolve conflicting priorities over time. Results on PersonaKAG suggest that effective personalization is still challenging for state-of-the-art LLM agents.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, corpus creation, automatic evaluation of datasets
Languages Studied: English
Submission Number: 5427
Loading