When Does Evidence Help Prompted LLM Forecasting? Evidence Access and Prompt Structure Across 12 Models

Akram Naoufel Tabet; mitja luštrek

When Does Evidence Help Prompted LLM Forecasting? Evidence Access and Prompt Structure Across 12 Models

Akram Naoufel Tabet, mitja luštrek

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Forecast@ICML26 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM forecasting, probabilistic forecasting, evidence access, prompt engineering, calibration, prediction markets, ForecastBench

TL;DR: Shared evidence improves LLM forecasting across 12 models, but no model beats the prediction-market baseline; Bayesian prompting gains most from evidence, though the prompt × evidence interaction remains only suggestive.

Abstract: We present a controlled study of how evidence access and prompt structure shape LLM-based forecasting. Using 114 resolved binary questions from ForecastBench, we evaluate 12 models across three prompting strategies: a direct control prompt, a base-rate prompt that asks models to anchor on historical frequency, and a Bayesian prompt that asks models to state a prior and update on evidence. We compare two information conditions: closed-book forecasting, where models receive no external evidence, and shared evidence retrieved via AskNews before the timestamp at which the market baseline is recorded. Across 8,208 forecasts, shared evidence improves Brier scores across models and prompts ($\Delta=-0.028$, 95\% CI $[-0.036,-0.019]$, $p<0.001$). Bayesian-style prompting performs worse than the control prompt in closed-book settings, consistent with recent evidence that structured reasoning prompts can degrade LLM forecasts when external information is unavailable. Although Bayesian prompting shows the largest numerical improvement from evidence, its advantage over the control prompt is only suggestive at pilot scale and does not reach conventional significance (DiD $=-0.011$, $p=0.08$). In an exploratory extension, a Superforecaster-style prompt performs strongly; because it was not part of the main confirmatory comparison, we report it separately. Despite these gains, no LLM beats the freeze-time prediction-market baseline. These findings suggest that evidence access improves LLM forecasting, but that prompt structure alone is insufficient: even evidence-grounded models remain behind market baselines and suffer from overconfident extreme probabilities.

Submission Number: 78

Loading