Keywords: forecasting, long-form outputs, evaluation, language models
TL;DR: We evaluate frontier models capability to generate long-form forecasts about world events by their effect on downstream predictions.
Abstract: Language model evaluations for judgmental forecasting currently test short-answer predictions to fully specified questions. However, when reasoning about the future, the precise important questions are often not known in advance, making real-world forecasting more open-ended. In this work, we study how to evaluate AI responses questions like ``How will AI capabilities progress by 2027?'', which have no single ground-truth. On the surface, this task is mired with tradeoffs between the accuracy, importance, and quality of evidence of the forecast's claims. Nevertheless, we show that a verifiable unification is possible. We argue that the value of a long-form forecast lies in how it updates the world model of a downstream predictor. Specifically, we measure how providing a long-form forecast improves the prediction accuracy of a weaker model for a sample of world events. We test seven frontier models with this framework, finding meaningful differences in their long-form forecasts about AI progress, which when conditioned on, lead to significant improvements in the event forecasts of the downstream predictors. We hope our methodology paves the way to measuring and improving the quality of generative forecasts used by people in their everyday decision-making.
Submission Number: 66
Loading