Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset

Lily H Zhang; Smitha Milli; Karen Long Jusko; Jonathan Smith; Brandon Amos; Wassim Bouaziz; Manon Revel; Jack Kussman; Yasha Sheynin; Lisa Titus; Bhaktipriya Radharapu; Jane Yu; Vidya Sarma; Kristopher Rose; Maximilian Nickel

Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset

Lily H Zhang, Smitha Milli, Karen Long Jusko, Jonathan Smith, Brandon Amos, Wassim Bouaziz, Manon Revel, Jack Kussman, Yasha Sheynin, Lisa Titus, Bhaktipriya Radharapu, Jane Yu, Vidya Sarma, Kristopher Rose, Maximilian Nickel

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: preference datasets, pluralistic alignment, algorithmic monoculture, human feedback

TL;DR: The as-of-date largest and most representative open-source preference dataset, based upon a new candidate response sampling strategy that improves the ability to learn heterogeneous preferences

Abstract: How can large language models (LLMs) serve users with varying preferences that may conflict across cultural, political, or other dimensions? To advance this challenge, this paper establishes four key results. First, we demonstrate, through a large-scale multilingual human study with representative samples from five countries (N=15,000), that humans exhibit substantially more variation in preferences than the responses of 21 state-of-the-art LLMs. Second, we show that existing methods for preference dataset collection are insufficient for learning the diversity of human preferences even along two of the most salient dimensions of variability in global values, due to the underlying homogeneity of candidate responses. Third, we argue that this motivates the need for _negatively-correlated sampling_ when generating candidate sets, and we show that simple prompt-based techniques for doing so greatly enhance the performance of alignment methods in learning heterogeneous preferences. Fourth, based on this novel candidate sampling approach, we collect and open-source _Community Alignment_, the largest and most representative multilingual and multi-turn preference dataset to date, featuring 233,319 comparisons from annotators spanning five countries. The dataset is available at https://huggingface.co/datasets/facebook/community-alignment-dataset. Overall, we hope that the Community Alignment dataset will be a valuable resource for improving the effectiveness of LLMs for a diverse global population.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 1025

Loading