Pluralistic Preference Alignment via Sortition-Weighted RLHF

Published: 02 Jun 2026, Last Modified: 02 Jun 2026Pluralistic-Alignment 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Alignment, Reinforcement Learning from Human Feedback, RLHF, Pluralistic Alignment, Sortition, Preference Optimization, Fairness, Social Choice, Bradley Terry
Abstract: Preference-based alignment methods such as RLHF often depend on convenience rater pools that are demographically skewed, potentially encoding the values of some groups over others. We study sortition-weighted preference learning: a representativeness-aware approach that brings algorithmic sortition, the mechanism used to form citizens' assemblies, into preference-based fine-tuning. The approach supports two schemes. Hard Panel trains only on preferences from a single quota-satisfying mini-public sampled by sortition, while Soft Panel retains all data but reweights each rater by their inclusion probability under the same lottery. Using PRISM rater demographics and preferences, we train Llama variants with DPO and evaluate them against a 75-clause constitution elicited from a representative U.S. panel. Across multiple aggregation families, Hard Panel ranks highest and Soft Panel improves over the Full PRISM baseline. We further test weighting functions, panel sizes, gradient diagnostics, and a second preference dataset, finding that sortition-based correction is useful but dataset-dependent: Hard Panel gives the strongest PRISM result, while Soft Panel is more competitive when hard filtering discards much higher-quality data. These results support a pluralistic, population-specific view of alignment: when a system is meant to serve a target public, the composition of the feedback signal can be made explicit, auditable, and empirically consequential.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 40
Loading