LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We present LLM-SRBench, the first comprehensive benchmark for evaluating scientific equation discovery with LLMs, designed to rigorously assess discovery capabilities beyond memorization
Abstract: Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect actual discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorization, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods on LLM-SRBench, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.
Lay Summary: Scientists have long sought to discover mathematical equations that explain how the natural world works, from gravity to climate patterns. Recently, researchers have been testing whether large language models (LLMs) can help in finding these scientific models by using their vast knowledge. However, current tests of these discovery frameworks are flawed because they are based on benchmarks with well-known equations that LLMs might have simply memorized during training. In this work, we created LLM-SRBench, a challenging new benchmark with 239 difficult problems across four scientific domains, addressing the memorization issue of current benchmarks. Our benchmark includes two types of challenges: problems that disguise familiar physics equations in unfamiliar mathematical forms (LSR-Transform), and completely synthetic problems that require genuine reasoning from data rather than recalling memorized facts (LSR-Synth). When we tested several leading LLM-based discovery frameworks on our benchmark, even the best performer only solved about one-third of the problems correctly. This reveals that current frameworks are far from being able to truly discover scientific equations on its own. Our benchmark provides researchers with a more honest way to measure progress in LLM-assisted scientific discovery, helping guide future breakthroughs in this emerging field.
Link To Code: https://github.com/deep-symbolic-mathematics/llm-srbench
Primary Area: Applications->Chemistry, Physics, and Earth Sciences
Keywords: Benchmark, Scientific Discovery, Large Language Models, Symbolic Regression
Submission Number: 14812
Loading