Position: Time to Close The Validation Gap in LLM Social Simulations

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 Position Paper Track regularEveryoneRevisionsBibTeXCC BY-SA 4.0
TL;DR: LLM-based social simulation research must pivot from expansion to consolidation—adopting shared benchmarks, open data, and reproducible evaluation—before these tools can responsibly inform high-stakes decisions.
Abstract: LLM-based social simulations—in which many language model agents are situated in social situations and interact over multiple turns—are rapidly proliferating across policy analysis, epidemiology, and computational social science. Yet the field lacks consensus on how to validate these simulations, with evaluation methods that are few, underdeveloped, fragmented, and rarely shared across disciplines. We argue this creates a serious risk: premature deployment of unvalidated simulators in high-stakes domains. Our position is that the field must pivot from expansion to consolidation, prioritizing methodological standardization—shared benchmarks, open data, and reproducible evaluation protocols grounded in social science and complex systems research. We outline a concrete research program organized around specific learning problems/benchmarks, providing a path toward answering the fundamental question: when are LLM social simulations useful modelling objects?
Lay Summary: Researchers are increasingly using AI language models to simulate how groups of people might behave, for instance to predict how a community responds to a new policy or how a disease spreads. These simulations are already being applied in high-stakes areas like public health and policymaking. The problem is that there is currently no agreed-upon way to check whether such simulations are accurate, and the methods that do exist are scattered and rarely shared between fields. We argue that the field is expanding faster than it is building the tools to verify these systems, which risks important decisions resting on simulators no one has confirmed actually work. Our position is that researchers should slow this expansion and instead agree on shared standards: common tests, openly available data, and repeatable evaluation methods grounded in established social science. To support this, we lay out a concrete research agenda built around specific, testable problems. The goal is to answer a basic question the field has skipped past: when can these AI social simulations be trusted as reliable models of real human behavior?
Link To Code: https://github.com/sandbox-social/silisocs
Primary Area: Research Priorities, Methodology, and Evaluation
Keywords: Large Language Models, social simulations, evaluation, reproducibility
Flagged For Ethics Review: true
Originally Submitted PDF: pdf
Submission Number: 642
Loading