Keywords: personalization, reward models, LLM-as-a-judge
TL;DR: We explore the training of personalized reward models conditioned on implicit preferences expressed in long-context usage data, and contribute evaluation benchmarks, synthetic training data and reward models.
Abstract: Reward models are widely used as a proxy for human preferences during the alignment of Large Language Models (LLMs). However, preferences are subjective and vary widely across users, motivating increased research on LLM personalization. Existing work on reward modeling for personalized generation remains limited, typically requiring *explicit*, pre-defined preferences and focusing mainly on *English responses*. Addressing these gaps, we establish benchmarks for multilingual Personalized Reward Models (PRMs) to identify user-preferred responses from unstructured user data containing *implicit* preferences. We introduce a novel framework for creating synthetic personalized reward modelling data at scale, and then evaluate PRMs on three multilingual text generation tasks. Our results show that small, fine-tuned open-source PRMs can achieve comparable or better performance than LLM-as-a-judge baselines. Even state-of-the-art proprietary reasoning LLMs achieve only 72% binary classification accuracy on our dataset, highlighting the complexity of our task. We conclude with experiments on PRM-Bench, a human-annotated user-preference benchmark, validating our models and synthetic data generation pipelines.
Primary Area: datasets and benchmarks
Submission Number: 9778
Loading