Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Keywords: LLM forecasting, Probabilistic reasoning, calibration, Prediction markets, Failure Modes, model evaluation
TL;DR: We test large language models on real-world forecasts and find accuracy and calibration vary by domain. Adding news can help but also introduces errors like recency bias, rumour anchoring, and definition drift.
Abstract: Large Language Models (LLMs) demonstrate partial forecasting competence across social, political, and economic events. Yet, their predictive ability varies sharply with domain structure and prompt framing. We investigate how forecasting performance varies with different model families on real-world questions about events that happened beyond the model cutoff date. We analyze how context, question type, and external knowledge affect accuracy and calibration, and how adding factual news context modifies belief formation and failure modes. Our results show that forecasting ability is highly variable as it depends on what, and how, we ask.
Supplementary Material: pdf
Submission Track: Workshop Paper Track
Submission Number: 36
Loading