TL;DR: Gravity-Bench evaluates scientific discovery capabilities of AI agents through gravitational dynamics simulations, challenging them to strategically collect observations and employ scientific reasoning, including in out-of-distribution scenarios.
Abstract: Modern science emerged from reasoning over repeatedly-observed planetary motions. We present Gravity-Bench-v1, an environment-based benchmark that challenges AI agents on tasks that parallel this historical development. Gravity-Bench-v1 evaluates agents on the discovery of physics concealed within a dynamic environment, using rigorous gravitational dynamics simulations. Gravity-Bench includes out-of-distribution cases, i.e. with physics that deviates from the real world, to evaluate true scientific generalization capabilities. Agents must plan to collect data within an experimental budget and must perform a dynamic form of data analysis and reasoning to solve tasks efficiently. Our benchmark admits an open-ended space of solutions. Reference solutions for each task are provided to calibrate AI performance against human expertise. Technically at an upper-undergraduate level, our benchmark proves challenging to baseline AI agents. Gravity-Bench-v1 and planned extensions should help map out AI progress towards scientific discovery capabilities.
Lay Summary: For AI to support scientific discoveries, it must not only analyze data, but actively explore and figure things out, just like human scientists do. To test and measure these skills, we created Gravity-Bench, an interactive test inspired by the historical scientific breakthroughs involving gravity and planetary motion.
In Gravity-Bench, AI agents act as astronomers exploring binary star systems, where two stars orbit each other. Instead of just analyzing ready-collected data, the AI agent has to plan and gather its own observations cautiously, within a budget. The tasks go beyond textbook examples: sometimes the simulated environment introduces new laws of physics that differ from the ones we know, compelling AI agents to adapt to solve novel problems.
Our tests show that current AI models, like those that power ChatGPT, still struggle with these challenges, particularly with planning observations and drawing correct conclusions. Gravity-Bench provides researchers a scientifically meaningful framework to track progress towards AI capable of original scientific discovery.
Link To Code: https://github.com/NolanKoblischke/GravityBench
Primary Area: Deep Learning->Large Language Models
Keywords: Scientific Discovery, Benchmarking, Evaluations, Physics, Agents, Planning, Large Language Models
Submission Number: 4029
Loading