Abstract: Aligning large language models (LLMs) with human
preferences typically assumes a single, universal reward function
learned from large-scale annotations. However, users have diverse
and sometimes conflicting preferences, and personalizing to
individual users requires methods that can learn reward functions
from very few examples. We present a framework for inter-
pretable reward modeling that extracts human-understandable
features from LLM responses and learns lightweight reward
functions over these features. Our approach decouples what
features matter (transferable across users) from how much each
feature matters (personalizable per user), enabling effective few-
shot adaptation to new users. We analyze which features drive
human preferences, finding that response detail and organization
are consistently important. We further investigate when pre-
diction fails, showing that model confidence is well-calibrated
to task difficulty. Finally, we extend our framework to multi-
turn conversations, discovering that users exhibit a “weakest-
link” behavior where conversation quality is judged by the worst
individual response. Our interpretable, data-efficient approach
provides a foundation for personalizable LLM alignment.
Loading