Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: benchmarking, evaluation, agentic AI, scientific agents
TL;DR: We propose a new framework for designing benchmarks to measure the capability of AI scientists to accurately perform scientific discovery tasks that reduces construct validity issues, resulting in more realistic and representative evaluations.
Abstract: Science is a system defined in part by measurability. Claims made under its banner are trusted under the implicit understanding that they can be verified through measurement. Trustworthy science is therefore only possible when accurate and verifiable measurements of all aspects of a discovery or observation are possible. Recently, a new interloper has emerged in the form of AI scientists. Driven by companies such as Sakana AI and Google, these hybrid human-AI systems tasked with scientific discovery strive to augment and accelerate the current research paradigm by intelligently innovating upon and combining preexisting ideas. As researchers attempt to build collaborative workflows with AI scientists, the need for better measurements of their capabilities and limitations escalates. In this paper, we argue that the complexity of scientific research represents a significant challenge to AI scientist benchmarking attempts on account of construct validity issues. Scientific research tasks must be parseable by AI scientists, otherwise these in silico collaborators pose a significant epistemic risk to the trustworthiness of scientific research. To address this, we propose a new framework for designing benchmarks for AI scientists based on Arthur Koestler’s concept of holons. Instead of benchmarking high-level human-interpretable tasks, we instead break them down and build specialized benchmarks at the LLM-executable level. These semantic sum of an AI scientist’s performance on these benchmarks will then approximate performance on the original task. Our framework outlines key criteria for future benchmarks to avoid construct validity issues. We also exemplify the potential of our framework by prototyping a benchmark for attributional accuracy ultimately aimed at evaluating AI scientists on their ability to generate literature reviews.
Submission Number: 427
Loading