Foresight-Phys: A Benchmark for Forecasting the Results of Physical Experiments

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Forecast@ICML26 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Forecasting, Physics, Large Language Models, Benchmarking, AI for Science, Foundation Models, Physical Intuition, Scientific Reasoning
TL;DR: A benchmark that evaluates the ability of AI to forecast outcomes of physical experiments
Abstract: Forecasting has emerged as a hallmark capability of general-purpose AI, and one where the bar is fundamentally super-human: the accuracy ceiling far exceeds what the best experts achieve. Scientific research itself is a forecasting problem-finite resources mandate exploration/exploitation tradeoff, which in turn relies on prophesying the results of studies before conducting them. We introduce **Foresight-Phys**, a benchmark where AI is asked to predict the results of physical experiments from newest arXiv preprints. In this paper we present the framework along with the results for 135 experiments and 661 typed result fields extracted from 26 physics preprints first released in 2026, after the GPT-5.x training cutoff. The extraction process is automated and is suitable for creating an online benchmark by regularly pulling fresh papers. The strongest model we evaluate, GPT-5.5, attains a mean prediction quality of $87.2\%$ on the benchmark-a soft, per-field score derived from proper scoring rules and designed to reward calibrated forecasts rather than point guesses alone. On numeric quantities it attains a relative Continuous Ranked Probability Score (CRPS normalised by the magnitude of the ground truth for normal targets, and measured in dex for log-normal targets) of $0.41$ and places over half ($52.4\%$) of its predictions within one standard deviation of the ground truth. While this indicates nontrivial physical intuition for relevant scales, it also reveals substantial overconfidence compared to an ideal Gaussian observer ($68.3\%$). Code and data are available at https://anonymous.4open.science/r/Foresight-Phys-7109/.
Submission Number: 185
Loading