Keywords: discovery, build, application, loop, science, llm, self-evolve, reasoning
TL;DR: a task to test dicovery-to-application in minecraft, model capacity gap decomposition, and an agent method
Abstract: The capacity to discover causal regularities and apply them to build functional systems—the discovery-to-application loop—is a hallmark of general intelligence. However, evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized red-stone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences), where scaling target parameters exponentially increases construction complexity and required knowledge—forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5, we find that all plateau at approximately 25\% success rate. To diagnose these failures, we decompose the loop into four capacities—knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application—and design targeted interventions to isolate each gap. Our analysis reveals that knowledge application bottlenecks mid-to-low tier models, while knowledge gap identification becomes the primary limitation for frontier models. We hope these findings shed light on current capacity gaps and inform future research on AI systems that can navigate the discovery-to-application loop.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 81
Loading