Outcome-Free Audits and Repairs for LLM Forecasters

Juliana Li; Diya Sreedhar

Outcome-Free Audits and Repairs for LLM Forecasters

Juliana Li, Diya Sreedhar

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Forecast@ICML26 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM forecasting, forecaster evaluation, Dutch-book coherence, outcome-free auditing, probabilistic forecasting, prediction markets, Brier score, CoherenceBench

TL;DR: Outcome-free Dutch-book audits for LLM forecasters across four axes detect probability-axiom violations without ground truth. None of 15 models clean on all four; coherent projection recovers 21% of pair-Brier loss free. CoherenceBench v0 released.

Abstract: LLM forecasters now post Brier scores rivaling crowd aggregates on ForecastBench and Prophet Arena, yet their probabilities can be exploited by a Dutch book that requires no resolved outcomes; the questions that matter most resolve years after deployment, where proper-scoring-rule loss is silent and unavailable. We introduce a deployment-time audit stack: given a forecast vector, it detects violations of probability coherence and reports four classes of Dutch-book exposure (complementary pairs, monotonicity chains, Fr\'echet conjunction bounds, logical entailment) without waiting for resolution. For complementary pairs the symmetric coherent projection has an exact Brier-improvement identity (Theorem 1), so the audit is therefore repairable. We release CoherenceBench v0 ($262$ questions, $68$ pairs, $19$ chains, $39$ conjunction triples, $21$ entailments) with the audit harness. Coherence audits complement rather than replace proper-scoring-rule evaluation: a forecaster can be perfectly coherent and uninformative ($f{\equiv}0.5$) or accurate and incoherent. Across 15 contemporary forecasters (9 open-weight, 6 closed), no observed forecaster is violation-free on all four axes (zeros on small-$n$ axes upper-bounded, not conclusive), and the model with the best aggregate coherence profile (R1-Distill-Qwen-32B-AWQ) has $\sim 15{\times}$ the mean Brier of GPT-5 ($0.27$ vs $0.018$). Symmetric coherent projection recovers $21.09\%$ of pair-Brier loss without additional model calls; per-pair improvement equals $g^2/4$ for $g{=}f(A){+}f(B){-}1$ (identity verified to numerical precision on $991$ labeled complementary observations).

Submission Number: 159

Loading