Evaluating LLMs on Real-World Forecasting Against Expert Forecasters

Evaluating LLMs on Real-World Forecasting Against Expert Forecasters

ICLR 2026 Conference Submission20819 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: automated forecasting, benchmarks

TL;DR: LLM forecasting rivals crowd forecasting accuracy but still falls short of expert forecasters

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied. Large language models used to struggle to come close to the accuracy of a human crowd. I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against top forecasters. Frontier models achieve Brier scores that ostensibly surpass the human crowd but still significantly underperform a group of experts.

Primary Area: datasets and benchmarks

Submission Number: 20819

Loading