Evaluating Long-Form Forecasts by Their Effect on Downstream Predictions

Jeremy Qin; Nikhil Chandak; Shashwat Goel; Hardik Bhatnagar; Ameya Prabhu; Jonas Geiping; Moritz Hardt; Maksym Andriushchenko

Evaluating Long-Form Forecasts by Their Effect on Downstream Predictions

Jeremy Qin, Nikhil Chandak, Shashwat Goel, Hardik Bhatnagar, Ameya Prabhu, Jonas Geiping, Moritz Hardt, Maksym Andriushchenko

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Forecast@ICML26 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: forecasting, long-form outputs, evaluation, language models

TL;DR: We evaluate frontier models capability to generate long-form forecasts about world events by their effect on downstream predictions.

Abstract: Language model evaluations for judgmental forecasting currently test short-answer predictions to fully specified questions. However, when reasoning about the future, the precise important questions are often not known in advance, making real-world forecasting more open-ended. In this work, we study how to evaluate AI responses questions like ``How will AI capabilities progress by 2027?'', which have no single ground-truth. On the surface, this task is mired with tradeoffs between the accuracy, importance, and quality of evidence of the forecast's claims. Nevertheless, we show that a verifiable unification is possible. We argue that the value of a long-form forecast lies in how it updates the world model of a downstream predictor. Specifically, we measure how providing a long-form forecast improves the prediction accuracy of a weaker model for a sample of world events. We test seven frontier models with this framework, finding meaningful differences in their long-form forecasts about AI progress, which when conditioned on, lead to significant improvements in the event forecasts of the downstream predictors. We hope our methodology paves the way to measuring and improving the quality of generative forecasts used by people in their everyday decision-making.

Submission Number: 66

Loading