OpenEstimate: Evaluating LLMs on Probabilistic Estimation with Real-World Data

OpenEstimate: Evaluating LLMs on Probabilistic Estimation with Real-World Data

ICLR 2026 Conference Submission21495 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: probabilistic estimation, reasoning, uncertainty, calibration

TL;DR: Language models (LMs) excel at reasoning on tasks with clear answers and complete information. Yet many real-world applications are open-ended and uncertain, and require reasoning about incomplete or noisy data.

Abstract: Decisions in the real world rely on noisy, limited data. Language models (LMs), with broad pretrained knowledge, can help decision-makers by offering informed Bayesian priors that guide better choices. However, the extent to which LMs can provide reliable priors remains poorly understood. We introduce OpenEstimate, a benchmark that asks LMs to express beliefs as Bayesian priors over real-world quantities from labor economics, private markets, and public health. We assess these priors for both accuracy and calibration, benchmarking them against statistical baselines built by sampling from the true distribution. Across six frontier LMs, LM-elicited priors are often inaccurate and overconfident: they seldom beat posteriors formed from five real observations. Performance improves modestly depending on how uncertainty is elicited from the model, but is largely unaffected by changes in temperature, reasoning effort, or system prompt. Given LMs’ weak performance, OpenEstimate offers an important foundation for building systems that can reason under uncertainty and know when to doubt themselves.

Primary Area: datasets and benchmarks

Submission Number: 21495

Loading