Rushes: A Human Preference Dataset for Pluralistic Alignment

ACL ARR 2026 January Submission8745 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: corpus creation, benchmarking, NLP datasets, evaluation methodologies, human-AI interaction, human-centered evaluation, interactive storytelling
Abstract: We introduce $\textbf{Rushes}$, a dataset and benchmark for studying revealed human engagement preferences in interactive narrative environments. Rushes is collected through a game interface where users interact with AI-generated branching narratives and select one choice from a small, explicit candidate set at each decision point. Each interaction logs the full candidate set, the user's choice, and the evolving narrative context, yielding time-ordered trajectories with persistent user-level identifiers. Rushes contains 44,226 decision events from 8,167 unique users across six games, capturing sequential, personalized engagement behavior rather than static judgments. We show that user choices exhibit structured, non-random patterns, quantified by a low choice entropy relative to a uniform baseline. We position Rushes as a diagnostic benchmark for pluralistic alignment and demonstrate a robust $\textit{Engagement Gap}$: state-of-the-art LLMs, including GPT-5, fail to outperform simple baselines. While classical Matrix Factorization (SVD) captures measurable personalized signal (37.7%), frontier LLMs (34.23%) struggle to even match the Popularity Baseline (36.4\%) on event-level choice prediction. This gap suggests that single, population-level objectives, like those used in modern RLHF, appear insufficient to capture heterogeneous, context-dependent engagement signals. As a result, even highly capable models default to majority preferences rather than adapting to individual trajectories. We release Rushes to support research into pluralistic alignment and sequential decision-making in generative systems.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation, benchmarking, NLP datasets, evaluation methodologies, human-AI interaction, human-centered evaluation, interactive storytelling
Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: english
Submission Number: 8745
Loading