Keywords: human behavior simulation; simulation reliability
Abstract: Large language models (LLMs) are increasingly used to simulate human survey responses and behavioral reactions, yet the conditions under which such simulations are reliable remain unclear, making it difficult to pinpoint where errors arise and which configuration choices drive them. To make reliability analysis more interpretable and actionable, we propose the Simulation Reliability Prism (SRP), which decomposes simulation into three structured layers and analyzes error propagation across layers along three key configuration dimensions—model capacity, profile completeness, and population coverage, while jointly evaluating two complementary targets: individual-level reliability and population-level reliability. Across three survey tasks and eleven LLMs, we show that profile conditioning is necessary to avoid systematic distributional bias, while increasing profile completeness yields diminishing individual gains and transfers unreliably to population-level improvements, sometimes reversing. Increasing population coverage mainly reduces variance, and population-level reliability typically stabilizes with fewer than 100 samples. Our findings offer practical guidance for reliable LLM-based survey simulation.
Paper Type: Long
Research Area: Computational Social Science, Cultural Analytics, and NLP for Social Good
Research Area Keywords: Computational Social Science and Cultural Analytics
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 2819
Loading