What Makes a Good Response? Learning Personal Preferences from Interpretable Features

Published: 22 Mar 2026, Last Modified: 08 May 2026ICNLP 2026EveryonearXiv.org perpetual, non-exclusive license
Abstract: Aligning large language models (LLMs) with human preferences typically assumes a single, universal reward function learned from large-scale annotations. However, users have diverse and sometimes conflicting preferences, and personalizing to individual users requires methods that can learn reward functions from very few examples. We present a framework for inter- pretable reward modeling that extracts human-understandable features from LLM responses and learns lightweight reward functions over these features. Our approach decouples what features matter (transferable across users) from how much each feature matters (personalizable per user), enabling effective few- shot adaptation to new users. We analyze which features drive human preferences, finding that response detail and organization are consistently important. We further investigate when pre- diction fails, showing that model confidence is well-calibrated to task difficulty. Finally, we extend our framework to multi- turn conversations, discovering that users exhibit a “weakest- link” behavior where conversation quality is judged by the worst individual response. Our interpretable, data-efficient approach provides a foundation for personalizable LLM alignment.
Loading