SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Published: 22 Jun 2025, Last Modified: 22 Jun 2025ACL-SRW 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: human behavior simulation, large language models, benchmarking, computational social science, human-AI alignment, calibration, human-centered AI
Abstract: Simulations of human behavior based on large language models (LLMs) have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Prior work across many disciplines has evaluated the simulation capabilities of specific LLMs in specific experimental settings, but often produced disparate results. To move towards a more robust understanding, we introduce SimBench, the first large-scale benchmark to evaluate how well LLMs can simulate group-level human behaviors across diverse settings and tasks. SimBench compiles 20 datasets in a unified format, measuring diverse types of behavior (e.g., decision-making vs. self-assessment) across hundreds of thousands of diverse participants from different parts of the world. Using SimBench, we can ask fundamental questions regarding when, how, and why LLM simulations succeed or fail. For example, we show that, while even the best LLMs today have limited simulation ability, there is a clear log-linear scaling relationship with model size, and a strong correlation between simulation and scientific reasoning abilities. We also show that base LLMs, on average, are better at simulating high-entropy response distributions, while the opposite holds for instruction-tuned LLMs. By making progress measurable, we hope that SimBench can accelerate the development of better LLM simulators in the future.
Archival Status: Non‑archival
Paper Length: Long Paper (up to 8 pages of content)
Submission Number: 195
Loading