Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the Diverse Framework

Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the Diverse Framework

ICLR 2026 Conference Submission19192 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Leaderboards, LLM, Evaluation, Benchmarking

TL;DR: We introduce DIVERSE, a new evaluation framework for LLMs that uses demographically stratified sampling and multi-turn conversations to reveal significant performance differences across user demographics.

Abstract: Current evaluation of large language models relies predominantly on technical benchmarks that fail to capture how users actually experience these systems in practice. Even the most notable human preference evaluation approaches suffer from methodological limitations including unrepresentative sampling, superficial assessment depth, and single-metric reductionism that obscures the multidimensional nature of human-AI interaction quality. We introduce DIVERSE, a rigorous evaluation framework that addresses these limitations through demographically stratified sampling, multi-turn naturalistic conversations, and assessment across five human-centric dimensions. We collected conversations from 21,352 participants stratified across 22 demographic groups in the US and UK, evaluating 27 state-of-the-art language models through pairwise comparisons. Using a robust hierarchical Bradley-Terry-Davidson model alongside post-stratified demographic adjustments to census weights, we reveal insights unavailable within existing approaches: (1) clear performance hierarchies with Gemini-2.5-Pro achieving 97% probability of ranking first for overall preference, (2) quantification of significant preference heterogeneity, identifying user age as the primary factor, revealing failures in model generalization across populations, and (3) differential discriminative power across human-centric evaluation dimensions, with Trust, Ethics & Safety showing significantly higher tie rates than task performance metrics. Our framework demonstrates that meaningful evaluation requires moving beyond aggregate preference scores to understand the complex, demographic-specific patterns that determine real-world model preference. We release our complete dataset, interactive leaderboard, and evaluation framework to catalyse further research into more rigorous and equitable evaluation of language models.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 19192

Loading