Keywords: LLM forecasting, Probabilistic reasoning, calibration, Prediction markets, Failure Modes, model evaluation
Abstract: Large Language Models (LLMs) demonstrate partial forecasting competence across social, political, and economic events. Yet, their predictive ability varies sharply with domain structure and prompt framing. We investigate how forecasting performance varies with different model families on real-world questions about events that happened beyond the model cutoff date. We analyze how context, question type, and external knowledge affect accuracy and calibration, and how adding factual news context modifies belief formation and failure modes. Our results show that forecasting ability is highly variable as it depends on what, and how, we ask.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation; benchmarking; NLP datasets; evaluation methodologies; evaluation; metrics; reproducibility; automatic evaluation of datasets
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 8235
Loading