When Does Evidence Help Prompted LLM Forecasting? Evidence Access and Prompt Structure Across 12 Models
Keywords: LLM forecasting, probabilistic forecasting, evidence access, prompt engineering, calibration, prediction markets, ForecastBench
TL;DR: Shared evidence improves LLM forecasting across 12 models, but no model beats the prediction-market baseline; Bayesian prompting gains most from evidence, though the prompt × evidence interaction remains only suggestive.
Abstract: We present a controlled study of how evidence access and prompt
structure shape LLM-based forecasting. Using 114 resolved
binary questions from ForecastBench, we evaluate 12 models across three
prompting strategies: a direct control prompt, a base-rate prompt that
asks models to anchor on historical frequency, and a Bayesian prompt
that asks models to state a prior and update on evidence. We compare
two information conditions: closed-book forecasting, where models
receive no external evidence, and shared evidence retrieved via AskNews
before the timestamp at which the market baseline is recorded. Across
8,208 forecasts, shared evidence improves Brier scores
across models and prompts ($\Delta=-0.028$, 95\% CI
$[-0.036,-0.019]$, $p<0.001$). Bayesian-style prompting performs worse
than the control prompt in closed-book settings, consistent with recent
evidence that structured reasoning prompts can degrade LLM forecasts
when external information is unavailable. Although Bayesian prompting
shows the largest numerical improvement from evidence, its advantage
over the control prompt is only suggestive at pilot scale and does not
reach conventional significance (DiD $=-0.011$, $p=0.08$). In an
exploratory extension, a Superforecaster-style prompt performs
strongly; because it was not part of the main confirmatory comparison,
we report it separately. Despite these gains, no LLM beats the
freeze-time prediction-market baseline. These findings suggest that
evidence access improves LLM forecasting, but that prompt structure
alone is insufficient: even evidence-grounded models remain behind
market baselines and suffer from overconfident extreme probabilities.
Submission Number: 78
Loading