Position: LLM Social-Simulation Agents in the Wild Cannot Serve as Social Scientific Evidence Without an Identification Strategy
Keywords: LLM agents, social simulation, agent reliability, identification strategy, agent safety in the wild, agentic deployment, position paper
TL;DR: LLM social-simulation agents pass fidelity checks but fail as evidence; we propose an identification-disclosure standard.
Abstract: This position paper argues that LLM social simulations cannot substitute for human-subject evidence without an identification strategy. Recent work shows high predictive fidelity --- GPT-4 simulations correlate strongly with human treatment effects across hundreds of contrasts, including a post-training-cutoff subset --- while critical work shows synthetic respondents fail regression, prompt-sensitivity, and temporal-stability tests. These findings are not contradictory: prediction asks whether outputs match observed outcomes, while substitution asks whether the simulation identifies the social data-generating process. We formalize the problem as observational equivalence among three mechanisms: training-prior retrieval, prompt-induced role compliance, and genuine interactional emergence. The same outcome distribution can be rationalized by multiple combinations of these mechanisms, so predictive fit alone cannot identify emergence. We audit the principal LLM social-simulation literature through this lens, concede the strongest predictive-fidelity result, and show why it cannot license replacing respondents or experiments. NeurIPS should require an identification standard: simulations may generate hypotheses or forecasts, but they become evidence only when their identifying assumptions are explicit, testable, and stress-tested. Positioned for AIWILD, the paper treats LLM social-simulation pipelines as agentic AI deployed in the wild of social-scientific decision-making, where reliability and trustworthiness require not only predictive fit but identification of the agentic data-generating process; without it, agentic outputs masquerade as human-subject evidence and propagate into policy decisions through a failure mode that is invisible to standard agent benchmarks.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 330
Loading