Keywords: LLM Evaluation, LLM Benchmark, Social Reasoning, dataset, opinion prediction
Abstract: Large Language Models (LLMs) are increasingly used to predict public opinion, but are typically evaluated on structured surveys, which strip away the rich social, cultural, and temporal context of real-world discourse. This misalignment creates a critical evaluation gap. To address this, we introduce MindVote, the first benchmark for public opinion prediction grounded in authentic social media. MindVote consists of 3,918 naturalistic polls from Reddit and Weibo, spanning 23 topics and enriched with detailed contextual metadata. Our evaluation of 15 LLMs on MindVote reveals that general-purpose models outperform the models that fine-tuned on survey, highlighting the importance of in-context reasoning. MindVote provides a robust evaluation framework towards developing more socially intelligent AI.
Submission Number: 13
Loading