Position: Multi-Agent LLM Simulation as Approximate Posterior Inference Demands a Probabilistic Calibration Standard

Published: 30 May 2026, Last Modified: 01 Jun 2026SPIGM @ ICML PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-agent simulation, large language models, posterior inference, simulation-based calibration, Bayesian workflow, structured probabilistic inference, position paper
TL;DR: LLM multi-agent simulations are approximate posterior sampling. SBC chi-sq detects miscalibration: 9.9 (calibrated, passes) vs 1571 (miscalibrated T=2, massive fail). 4-item Bayesian calibration standard.
Abstract: Multi-agent simulations driven by large language models (LLMs) are increasingly used to study emergent collective behavior. The standard reporting convention treats the empirical distribution of agent strategies as a direct measurement. We argue that this convention obscures a probabilistic structure: an LLM-driven multi-agent simulation is best understood as an approximate sampling procedure that produces draws from p(theta | E, c), the posterior over latent agent-population parameters theta conditional on environment E and prompt configuration c. The standards of probabilistic inference therefore apply: posterior contraction, prior sensitivity, simulation-based calibration (SBC), and identifiability. We support the position with three micro-experiments on a stylized 25-agent population: posterior contraction with rounds (TV-distance shrinks from 0.21 at R=1 to 0.008 at R=100), temperature-induced posterior bias (TV to true mixture rises monotonically from 0.008 at T=1 to 0.057 at T=3 even as the empirical distribution converges precisely to the wrong target), and SBC rank-statistic diagnostics that pass under correct calibration (chi-sq in [5.2, 9.9], well below the critical value 16.92) and fail catastrophically under temperature miscalibration (chi-sq in [1571, 1727], two orders of magnitude beyond the critical value). We propose a four-item probabilistic calibration standard for multi-agent LLM simulation papers: posterior summary disclosure, prior sensitivity analysis, simulation-based calibration check, and identifiability assessment.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 256
Loading