Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for Agents

Nolan Koblischke; Hyunseok Jang; Kristen Menou; Mohamad Ali-Dib

Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for Agents

Nolan Koblischke, Hyunseok Jang, Kristen Menou, Mohamad Ali-Dib

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Gravity-Bench evaluates scientific discovery capabilities of AI agents through gravitational dynamics simulations, challenging them to strategically collect observations and employ scientific reasoning, including in out-of-distribution scenarios.

Abstract: Modern science emerged from reasoning over repeatedly-observed planetary motions. We present Gravity-Bench-v1, an environment-based benchmark that challenges AI agents on tasks that parallel this historical development. Gravity-Bench-v1 evaluates agents on the discovery of physics concealed within a dynamic environment, using rigorous gravitational dynamics simulations. Gravity-Bench includes out-of-distribution cases, i.e. with physics that deviates from the real world, to evaluate true scientific generalization capabilities. Agents must plan to collect data within an experimental budget and must perform a dynamic form of data analysis and reasoning to solve tasks efficiently. Our benchmark admits an open-ended space of solutions. Reference solutions for each task are provided to calibrate AI performance against human expertise. Technically at an upper-undergraduate level, our benchmark proves challenging to baseline AI agents. Gravity-Bench-v1 and planned extensions should help map out AI progress towards scientific discovery capabilities.

Lay Summary: For AI to support scientific discoveries, it must not only analyze data, but actively explore and figure things out, just like human scientists do. To test and measure these skills, we created Gravity-Bench, an interactive test inspired by the historical scientific breakthroughs involving gravity and planetary motion. In Gravity-Bench, AI agents act as astronomers exploring binary star systems, where two stars orbit each other. Instead of just analyzing ready-collected data, the AI agent has to plan and gather its own observations cautiously, within a budget. The tasks go beyond textbook examples: sometimes the simulated environment introduces new laws of physics that differ from the ones we know, compelling AI agents to adapt to solve novel problems. Our tests show that current AI models, like those that power ChatGPT, still struggle with these challenges, particularly with planning observations and drawing correct conclusions. Gravity-Bench provides researchers a scientifically meaningful framework to track progress towards AI capable of original scientific discovery.

Link To Code: https://github.com/NolanKoblischke/GravityBench

Primary Area: Deep Learning->Large Language Models

Keywords: Scientific Discovery, Benchmarking, Evaluations, Physics, Agents, Planning, Large Language Models

Submission Number: 4029

Loading