Evaluating LLMs on Real-World Forecasting Against Expert Forecasters

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: automated forecasting, benchmarks
TL;DR: LLM forecasting rivals crowd forecasting accuracy but still falls short of expert forecasters
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied. Large language models used to struggle to come close to the accuracy of a human crowd. I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against top forecasters. Frontier models achieve Brier scores that ostensibly surpass the human crowd but still significantly underperform a group of experts.
Primary Area: datasets and benchmarks
Submission Number: 20819
Loading