User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction

Yuren Hao; Shuhaib Mehri; ChengXiang Zhai; Dilek Hakkani-Tür

User Preference Modeling for Conversational LLM Agents: Weak Rewards from Retrieval-Augmented Interaction

Yuren Hao, Shuhaib Mehri, ChengXiang Zhai, Dilek Hakkani-Tür

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: conversational AI, LLM, personalization, user modeling

TL;DR: We propose a pipeline-agnostic method that model user's preferences as vectors through weak feedback during interactions, and results show that the vectors correctly represent preference and improve retrieval.

Abstract: We present a frozen-backbone user modeling framework that represents each user as a low-dimensional dual vector (long-term and short-term) in a shared preference space, updated online from weak scalar rewards via REINFORCE—without modifying any backbone model. The framework is pipeline-agnostic: any feedback reducible to a scalar reward can drive user-vector learning. Preferences are extracted as structured condition--action rules, stored in a retrieval-augmented memory, and the user vector modulates retrieval scores to surface the most relevant preferences for each query. We evaluate on \textsc{MultiSessionCollab}, an online multi-session benchmark with LLM-simulated users who enforce rich style preferences, across three task domains (math-hard, math-500, bigcodebench) with $60$ user profiles over $60$ sessions each. Our RAG+Vector agent achieves the highest task success ($55.2\%$) among six system modes and significantly reduces interaction friction versus a Reflection baseline: timeout rate drops by $2.4$\,pp ($p = 0.046$) and user effort by $6.7\%$ ($p = 0.021$), yielding the highest interaction efficiency ($2.83$ successes per $1{,}000$ user tokens). Analysis of the learned vectors confirms that the dual-vector design induces meaningful preference geometry: long-term vectors significantly associate with cross-user preference overlap ($p = 0.006$), while short-term vectors do not ($p = 0.586$), validating the separation of stable user identity from session-specific context.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 86

Loading