Elicitation Format Drives Divergent LLM Geopolitical Forecasts

Published: 11 Jun 2026, Last Modified: 25 Jun 2026Forecast@ICML26 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: forecasting, biases, evaluation, geopolitical
Abstract: Large language models are approaching expert-level performance on geopolitical forecasting tasks, but a broad literature on LLM behavior shows that model outputs can shift under minor prompt perturbations. Whether matched geopolitical forecasts are similarly unstable under benign changes in elicitation remains underexplored. We study that question in a closed-book setting using Claude, GPT-OSS, and Qwen models and matched country-index forecasting tasks that hold the country, index, and horizon fixed while varying question form. A closed-book ForecastBench control confirms that the models are competent forecasters. Yet on governance targets, binary questions produce much larger US-sphere versus China-sphere gaps than matched numerical forecasts of the same country--index pairs. A Human Freedom Index comparison shows a smaller cross-bloc gap on matched economic sub-indices, suggesting that the binary amplification is concentrated in politically evaluative concepts rather than country forecasting in general. Trilingual reruns reveal additional but less uniform instability, and mirrored improve/decline prompts do not support a simple yes-saying explanation. We therefore argue that evaluations of LLM geopolitical forecasting should report robustness to elicitation alongside resolved-event accuracy, especially for politically evaluative targets.
Submission Number: 67
Loading