LikeBench: Evaluating Subjective Likability in LLMs for Personalization

LikeBench: Evaluating Subjective Likability in LLMs for Personalization

ACL ARR 2026 January Submission2413 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: likability, personalization, llm, adaptability, memory, benchmark

Abstract: A personalized LLM should remember user facts, apply them correctly, and adapt over time to provide responses that the user prefers. Existing LLM personalization benchmarks are largely centered on two axes: accurately recalling user information and accurately applying remembered information in downstream tasks. We argue that a third axis, likability, is both subjective and central to user experience, yet under-measured by current benchmarks. To measure likability holistically, we introduce LikeBench, a multi-session, dynamic evaluation framework that measures likability across multiple dimensions by how much an LLM can adapt over time to a user’s preferences to provide more likable responses. In LikeBench, the LLMs engage in conversation with a simulated user and learn preferences only from the ongoing dialogue. As the interaction unfolds, models try to adapt to responses, and after each turn, they are evaluated for likability across seven dimensions by the same simulated user. To the best of our knowledge, we are the first to decompose likability into multiple diagnostic metrics, which makes it easier to pinpoint where a model falls short. To improve realism and discriminativeness, LikeBench uses fine-grained, psychologically grounded personas instead of coarse high/low trait ratings used in prior work. Our benchmark shows that strong memory performance does not guarantee high likability: DeepSeek R1, with lower memory accuracy (86\%, 17 facts/profile), outperformed Qwen3 by 28\% on likability score despite Qwen3’s higher memory accuracy (93\%, 43 facts/profile). Even SOTA models like GPT-5 adapt well in short exchanges but show only limited robustness in longer, noisier interactions.

Paper Type: Long

Research Area: Human-AI Interaction/Cooperation and Human-Centric NLP

Research Area Keywords: human-AI interaction/cooperation, human-centered evaluation, user-centered design, evaluation and metrics, conversational modeling, prompting, benchmarking, evaluation methodologies

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 2413

Loading