Precursors, Proxies, and Predictive Models for Long-Horizon Tasks

Samuel F. Brown; Jaco Du Toit; Leo Hyams; Daniil Anisimov

Precursors, Proxies, and Predictive Models for Long-Horizon Tasks

Samuel F. Brown, Jaco Du Toit, Leo Hyams, Daniil Anisimov

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, agents, evaluation, evals, long-horizon

Abstract: AI agents show remarkable success at various short tasks, and are rapidly improving at longer-horizon tasks, creating a need to evaluate AI capabilities on dangerous tasks which require high autonomy. Evaluations (evals) comprising long-running "real-world" tasks may be the best proxies for predicting general performance, but they are expensive to create, run, and compare to human baselines. Furthermore, these tasks often rely on a large, interwoven set of agent skills, which makes predicting capabilities development difficult. We hypothesize that precursor capabilities including 'persistence', 'dexterity', and 'adaptability' are upstream of robust autonomous performance on long-horizon tasks, and design simple procedurally-generated ``proxy'' evals to target these precursors. We then use agent performance on our proxy evals to calibrate a preliminary method of capability prediction on a more complex task: SWE-Bench. Our preliminary results show that performance on certain proxy evals can be unusually predictive of performance on other evals. We find that a simple adaptability proxy based on developmental psychology correlates with SWE-bench with $r=0.95$, and three other proxies correlate with SWE-bench at $r>0.8$. A proxy eval which only takes ${\sim}$10 steps is strongly correlated with the performance of many other evals, which otherwise take much longer to terminate (${\sim}$100s of steps). For our predictive model, our initial results correctly predict agent scores on SWE-bench, but have large error bars, suggesting that -- testing more models on more synthetic evals -- we can quickly and cheaply predict performance on important long-horizon tasks.

Submission Number: 122

Loading