Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge

Drago Plecko; Patrik Okanovic; Shreyas Havaldar; Torsten Hoefler; Elias Bareinboim

Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge

Drago Plecko, Patrik Okanovic, Shreyas Havaldar, Torsten Hoefler, Elias Bareinboim

20 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Probabilistic Reasoning

Abstract: Artificial intelligence (AI) systems hold great promise for advancing various scientific disciplines, and are increasingly used in real-world applications. Despite their remarkable progress, further capabilities are expected in order to achieve more general types of intelligence. A critical distinction in this context is between factual knowledge, which can be evaluated against true or false answers (e.g., "what is the capital of England?"), and probabilistic knowledge, which reflects probabilistic properties of the real world (e.g., "what is the sex of a computer science graduate in the US?"). Much of previous work on evaluating large language models (LLMs) focuses on factual knowledge, while in this paper, our goal is to build a benchmark for understanding the capabilities of LLMs in terms of knowledge of probability distributions describing the real world. Given that LLMs are trained on vast amounts of text, it may be plausible that they internalize aspects of these distributions. Indeed, this idea has gained traction, with LLMs being touted as powerful and universal approximators of real-world distributions. At the same time, classical results in statistics, known under the term curse of dimensionality, highlight fundamental challenges in learning distributions in high dimensions, challenging the notion of universal distributional learning. In this work, we develop the first benchmark to directly test this hypothesis, evaluating whether LLMs have access to empirical distributions describing real-world populations across domains such as economics, health, education, and social behavior. Our results demonstrate that LLMs perform poorly overall, and do not seem to internalize real-world statistics naturally. This finding also has important implications that can be interpreted in the context of Pearl’s Causal Hierarchy (PCH). Our benchmark demonstrates that language models do not contain knowledge on observational distributions (Layer 1 of the PCH), and thus the Causal Hierarchy Theorem implies that interventional (Layer 2) and counterfactual (Layer 3) knowledge of these models is also limited.

Primary Area: datasets and benchmarks

Submission Number: 24965

Loading