Presentation Robustness for LLM Forecasters

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Forecast@ICML26 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: trustworthy ai, forecasting, ai prophet
TL;DR: We test semantic invariance in AI forecasting by varying how the same question and evidence are presented, then measuring whether inconsistent probabilities reveal when a model is likely to be wrong.
Abstract: Language models are increasingly used as probabilistic forecasters for real-world events. A basic reliability question is whether equivalent descriptions of the same event and evidence yield similar probabilities. We study this through presentation robustness: for each binary forecasting market, we hold the target, outcome, and source identities fixed while changing either source-summary wording or question phrasing. On 200 resolved Prophet Arena target markets and four LLMs, equivalent presentations frequently change forecasts, including side flips across the $0.5$ decision boundary. These changes predict forecast error, separate useful stability from uninformative uncertainty, and reveal failures hidden by single-prompt evaluation. Source-summary rewrites and question rephrasings expose complementary failure modes, showing that robustness to one wording change does not imply robustness to the other. Prompt averaging helps when alternate wordings move the model toward a strong reference forecast and can hurt when they move away. Our results establish presentation robustness as a practical evaluation axis for LLM forecasting, alongside accuracy and calibration.
Submission Number: 154
Loading