Abstract: Personalized alignments towards individual users have been a long-standing goal in large language models (LLMs). We introduce Drift, a novel framework that personalizes LLMs at decoding time with implicit user preferences. Unlike traditional Reinforcement Learning from Human Feedback (RLHF), which relies on vast annotated datasets and expensive gradient updates, Drift operates in a training-free manner by steering a frozen LLM through few-shot preference modeling. Our approach represents user preferences as a composition of interpretable and predefined attributes, and employs a zero-shot rewarding mechanism based on contrastive system prompts. Experiments on both a synthetic persona dataset Perspective and a real human-annotated dataset PRISM demonstrate that Drift achieves performance comparable to standard RLHF methods while using only 50–100 examples. Our results show that Drift delivers not only computationally efficient but also interpretable personalization.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: personalized alignments, few-shot personalization, efficient preference modeling, decoding-time alignments
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Data resources, Theory
Languages Studied: English
Submission Number: 902
Loading