Keywords: Large Language Models, Preference Alignment, Inference-Time Method
Abstract: Aligning large language models with human preferences is critical for creating
reliable and controllable AI systems. A human preference can be visualized as a
high-dimensional vector where different directions represent trade-offs between
desired attributes (e.g., helpfulness vs. verbosity). Yet, because the training data
often reflects dominant, average preferences, LLMs tend to perform well on com-
mon requests but falls short in specific, individual needs. This mismatch creates
a preference coverage gap. Existing methods often address this through costly
retraining, which may not be generalized to the full spectrum of diverse preferences.
This brittleness means that when a user’s request reflects a nuanced preference
deviating from the training data’s central tendency, model performance can degrade
unpredictably. To address this challenge, we introduce Robust Preference Selection
(RPS), a post-hoc, training-free method by leveraging directional neighborhood
consensus. Instead of forcing a model to generate a response from a single, highly
specific preference, RPS samples multiple responses from a local neighborhood
of related preferences to create a superior candidate pool. It then selects the re-
sponse that best aligns with the user’s original intent. We provide a theoretical
framework showing that, under mild conditions where (i) nearby preference direc-
tions correspond to better-trained regions of the model and (ii) the reward-model
scores change smoothly with small angular changes in the preference vector, our
neighborhood generation strategy yields a higher expected best score than a strong
baseline that also samples multiple candidates. Comprehensive experiments across
three distinct alignment paradigms (DPA, DPO, and SFT) demonstrate that RPS
consistently improves robustness against this baseline, achieving win rates of up
to 69% on challenging preferences from under-represented regions of the space
without any model retraining. Our work presents a practical, theoretically-grounded
solution for enhancing the reliability of preference-aligned models.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 17218
Loading