Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: agents, scientific agents, benchmarking, ai4chemistry, ai4materials, llms
TL;DR: We introduce Corral, a framework for evaluating scientific LLM agents.
Abstract: Large language models (LLMs) equipped with external tools through agentic frameworks promise to overcome domain-specific limitations by providing specialized capabilities for scientific applications. However, the extent to which these systems genuinely enhance performance in complex scientific domains remains poorly understood. Here we present Corral, a modular benchmarking framework that systematically evaluates LLM-based agents across four expert-designed environments spanning molecular dynamics, machine learning, catalysis, and spectroscopy in chemistry and materials science. Through comprehensive evaluation of state-of-the-art models using different agent scaffolds, we demonstrate that the choice of agentic framework---whether ReAct or tool-calling---plays a surprisingly minor role in determining success. Instead, performance depends critically on the semantic alignment between available tools and task requirements, measured through embedding similarity. When this alignment is poor, even sophisticated reasoning frameworks cannot compensate for inadequate tool provisioning and a lack of domain knowledge. Conversely, when base models possess sufficient domain knowledge, agentic frameworks can introduce unnecessary overhead without meaningful benefits. Our findings challenge the assumption that agentic systems provide a universal solution to model limitations, revealing instead that, currently, successful scientific agents might require the same level of domain expertise in tool design that was promised to be circumvented.
Submission Number: 261
Loading