Keywords: Large language model, Preference alignment, Pluralistic alignment, System message, Instruction tuning
TL;DR: Proposes to achieve individualized and scalable alignment of LLMs through verbalizing values in the system message, flexibly steering the model towards personalized responses
Abstract: Current large language model (LLM) alignment methods often assume that aligning LLMs with general public preferences is optimal, overlooking individual value diversity. A major challenge in adopting a more individualized approach to LLM alignment is its lack of scalability, as it involves re-training new models for new value or user. We propose a new paradigm where users specify their values within the system message, steering LLM behavior to align with individual intentions. However, LLMs are typically trained on a generic system message (e.g., "You are a helpful assistant"). To improve generalization to diverse system messages, we create a system message dataset with 197k value combinations across 66k user instructions. We train a 7B LLM, Janus and test it on five benchmarks, adding various unseen system messages reflecting user preferences. Janus achieves high tie+win rates against leading models, including GPT-4. Janus also outperforms LLaMA 3 8B Instruct on general helpfulness benchmarks, suggesting that training with diverse system messages enhances alignment with both individual and general preferences. Code, dataset, benchmark, and models are available at https://github.com/kaistAI/Janus.
Submission Number: 73
Loading