From Forecast Scores to Auditable Benchmarks: WorldFork for LLM Forecasting Evaluation

Published: 25 May 2026, Last Modified: 25 May 2026CTB@ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM forecasting, benchmark design, uncertainty quantification, calibration, proper scoring rules, leakage controls, branching rollouts, endpoint ledgers, auditability, reproducibility
TL;DR: WorldFork reframes LLM forecasting evaluation as an auditable benchmark object that exposes leakage controls, uncertainty composition, endpoint semantics, and trace-level failures alongside proper scores.
Abstract: Foundation-model forecasting benchmarks often report aggregate scores without specifying how uncertainty, leakage, endpoint semantics, or extraction choices affect whether a result should generalize. We introduce WorldFork as a benchmark-design case study for LLM forecasting agents: a public event card is converted into branching timelines with actor state, endpoint ledgers, path mass, unresolved mass, provenance, and a scoring-rule-compatible extraction rule. The central object is therefore not only a forecast probability, but an auditable record of how uncertainty moves through decomposition, branch policy, endpoint settlement, and report generation. On 24 masked retrospective resolved-event cards, unconditional branching reduces WorldFork Brier score from 0.282 to 0.214 and log score from 0.725 to 0.581; a fixed 50/50 blend with a direct JSON forecast reaches Brier 0.205. We treat these numbers as descriptive stress-test evidence, not a guarantee: retrospective masking only partially controls leakage, the exact sign test is suggestive but not significant ($p=0.064$), the paired bootstrap interval includes zero, and multiple comparisons were explored. The contribution is a guarantee-oriented benchmark protocol that makes pre-registration, leakage audit, uncertainty composition, and trace-level failure analysis explicit for future locked evaluations.
Paper Type: Short (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 102
Loading