LLMs as Implicit Probabilistic World Models: Distributional Stability, Convergence, and Scale Discordance

ACL ARR 2026 January Submission7855 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, LLM Evaluation, Distributional Evaluation, Uncertainty Analysis, Model Scaling, Convergence Dynamics, Prompt Sensitivity, Synthetic Data Generation, Decision Simulation
Abstract: Large Language Models (LLMs) are increasingly used to generate structured predictions and simulate human decision-making, yet the distributional behavior of their outputs remains underexplored. We treat LLMs as implicit probabilistic world models and introduce a systematic framework for evaluating their output distributions under repeated sampling, prompt perturbations, and model scaling. Across models and domains, we observe a consistent \emph{modes-first, tails-later} dynamics: high-probability outcomes stabilize within a few iterations, while low-probability alternatives continue to evolve, with post-convergence variability dominated by the distribution tail rather than shifts in dominant predictions. We further find pronounced \emph{scale discordance}, with low agreement between models of different sizes, suggesting that scaling alters underlying probabilistic representations rather than merely refining them. Prompt-based perturbations, including persona conditioning and seasonal context, induce systematic but bounded distributional shifts that are smaller in magnitude than differences across model scales. Experiments on air travel destination choice, with cross-domain verification in restaurant and product selection, demonstrate the robustness of these behaviors. Together, our results provide a distribution-level perspective on LLM stability and sensitivity, with implications for probabilistic simulation and decision-support applications.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation methodologies, metrics, statistical testing for evaluation, reproducibility
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 7855
Loading