Keywords: forecasting, benchmarks, surrogate scoring rules, elicitation without verification, peer prediction, scoring rules
TL;DR: We evaluate LLM forecasters by comparing their predictions against one another rather than waiting for outcomes, enabling immediate scores that closely track traditional forecasting metrics.
Abstract: Large language models are increasingly used as general-purpose forecasters, but benchmarking their forecasting ability remains slow and fragile. Retrospective benchmarks risk training-data contamination, while prospective benchmarks require waiting weeks or months for questions to resolve. We propose an instantaneous evaluation method based on proxy scoring rules. Rather than scoring each forecast against the eventual outcome, we score it against an extremized aggregate of the forecasts made by other models, drawing on the literature on information elicitation without verification. Empirically these proxy scores correlate strongly with resolved-outcome metrics such as the Brier score on existing LLM forecasting data and are almost as predictive as Brier scores of future performance and substantially less noisy across time. We further show the effectiveness of the method crucially depends on the aggregation method used: simple means and medians can perform poorly, while logit-mean aggregation followed by extremization yields consistently strong correlations.
Submission Number: 62
Loading