Keywords: Forecasting, LLMs
Abstract: Top AI forecasting systems perform similarly to skilled humans by combining frontier LLMs with forecasting-specific context gathering and scaffolding. We study how to improve this recipe through ensembling: given a fixed number of samples, which off-the-shelf model forecasts should be combined to maximize forecasting performance? On binary questions from the Metaculus AI Benchmark, we find that standalone performance is not enough for strong ensembling: additional forecasts add little when they come from frontier LLMs with highly correlated predictions. Instead, the strongest ensembles combine accurate but diverse forecasters; among the off-the-shelf frontier models we study, we find Grok 4 to be especially valuable because its predictions are less correlated with those of Gemini 3 Pro and GPT-5. These results suggest that the strength of the AI crowd comes not from sampling more forecasts indiscriminately, but from combining forecasts across models with complementary errors, motivating forecasting systems that explicitly optimize for both model quality and diversity.
Submission Number: 69
Loading