Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression

ACL ARR 2025 February Submission2293 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) are currently aligned using techniques such as reinforcement learning from human feedback (RLHF). However, these methods use scalar rewards that can only reflect user preferences on average. Pluralistic alignment instead seeks to capture diverse user preferences across a set of attributes, moving beyond just helpfulness and harmlessness. Toward this end, we propose a steerable pluralistic model based on few-shot comparative regression that can adapt to individual user preferences. Our approach leverages in-context learning and reasoning, grounded in a set of fine-grained attributes, to compare response options and make aligned choices. To evaluate our algorithm, we also propose two new steerable pluralistic benchmarks based on the Moral Integrity Corpus (MIC) and HelpSteer2 datasets, demonstrating the applicability of our approach to value-aligned decision-making and reward modeling, respectively. Our few-shot comparative regression approach is interpretable and compatible with different attributes and LLMs, while outperforming multiple baseline and state-of-the-art methods. Our work provides new insights and research directions in pluralistic alignment, enabling a more equitable and representative use of LLMs and advancing the state-of-the-art in ethical AI. Our benchmarks and code will be made publicly available at: <redacted>.
Paper Type: Long
Research Area: Human-Centered NLP
Research Area Keywords: user-centered design, value-centered design, human factors in NLP, values and culture
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: English
Submission Number: 2293
Loading