Abstract: Foundation models excel at single-turn reasoning, but many real-world challenges, from scientific research to technology development, require multi-turn exploration in dynamic interactive environments. Crucial components of learning from experience in these settings, such as efficiently gathering information to test hypotheses, meta-learning a model of the world's dynamics, and adapting to unexpected changes, remain largely unexplored for these models. We first evaluate foundation models in Feature World, a setting that primarily tests information gathering about a static hidden reward function. In this initial setting, we show that state-of-the-art foundation models come close to optimal efficiency in selecting maximally informative actions in tasks with simple reward functions. As a proof of concept, we also show a model can gather information efficiently in a 3D embodied version of this task, though errors in vision limit some aspects of performance. In order to test exploration across multiple dependent turns and trials, we implement a custom, text-based version of the Alchemy environment, a benchmark designed for meta-learning. Here, agents must deduce a latent causal structure by integrating information across multiple state-dependent trials. In this more complex setting, we find that recent foundation models struggle to meta-learn strategies that enable improved performance over time. However, prompting the models to summarize their observations at regular intervals enables an emergent meta-learning process, allowing them to improve across trials. Notably, in some models, summarization also enabled adaptive re-learning of this information when the environment's rules change unexpectedly. While most models performed reasonably well on simple Feature World tasks, evaluations in Alchemy reveal stark differences in robustness among the models, with Gemini 2.5 performing best, followed by Claude 3.7, and ChatGPT-4o and o4-mini struggling the most. These results underscore Alchemy's value as a benchmark for meta-learning and strategy adaptation in foundation models. By moving beyond simple discovery to complex, stateful environments, we demonstrate that the most significant challenge for foundation agents is not selecting informative actions in the moment, but rather seeking and integrating knowledge through adaptive strategies over time. Intriguingly, we find there is likely no intrinsic barrier to future generations of foundation agents more fully mastering these abilities.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Updated the Discussion with a paragraph comparing our paper in more detail with previous benchmarks and pointing out specific insights unique to this work:
While recent benchmarks like DISCOVERYWORLD \citep{Jansen2024-gs} and ScienceAgentBench \citep{Chen2024-nu} evaluate agents on realistic, end-to-end scientific workflows, our work complements this macro-level view by isolating and quantifying the specific cognitive mechanisms underlying exploration in controlled, abstract settings. Our findings refine the broad observation in these benchmarks that agents struggle with discovery tasks by identifying precisely where this struggle occurs. In Feature World, we find that foundation models act as near-optimal information gatherers, demonstrating that the immediate logic of selecting informative actions---a core component of experimental design---is already a robust capability, contradicting the impression from broader benchmarks that agents inherently lack exploratory capability. Instead, our Alchemy results suggest that some of the failure modes observed in broader benchmarks may stem from deficits in the ability to integrate observations into a coherent and adaptable world model across trials. Furthermore, we demonstrate that this deficit is not permanent: lightweight scaffolding, such as periodic summarization, can unlock these latent capabilities.
Video: https://youtu.be/WgkJrNq7Vbo
Assigned Action Editor: ~Lihong_Li1
Submission Number: 5560
Loading