FEABench: Evaluating Language Models on Real World Physics Reasoning Ability

Nayantara Mudur; Hao Cui; Subhashini Venugopalan; Paul Raccuglia; Michael Brenner; Peter Christian Norgaard

FEABench: Evaluating Language Models on Real World Physics Reasoning Ability

Nayantara Mudur, Hao Cui, Subhashini Venugopalan, Paul Raccuglia, Michael Brenner, Peter Christian Norgaard

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: numerical analysis, finite element, benchmark, agents

TL;DR: How well can LLMs leverage FEA software to simulate and solve problems that require numerical analysis?

Abstract: Building precise simulations of the real world and invoking numerical solvers to answer quantitative problems is an essential requirement in engineering and science. We present FEABench, a benchmark to evaluate the ability of large language models (LLMs) and LLM agents to simulate and solve physics, mathematics and engineering problems using finite element analysis (FEA). We introduce a multipronged evaluation scheme to investigate the ability of LLMs to solve these problems by reasoning over natural language problem descriptions and operating COMSOL Multiphysics$^\textregistered$, an FEA software, to compute the answers. In addition to testing state-of-the art-LLMs, we further design a language model agent equipped with the ability to interact with the software through its Application Programming Interface (API), examine its outputs and use tools to improve its solutions over multiple iterations. Our best performing strategy generates executable API calls 88\% of the time. However, this benchmark still proves to be challenging enough that the LLMs and agents we tested were not able to completely and correctly solve any problem. LLMs that can successfully interact with and operate FEA software to solve problems such as those in our benchmark would significantly push the frontiers of their utility. Acquiring this capability would augment LLMs' reasoning skills with the precision of numerical solvers and advance the development of autonomous systems that can tackle complex problems in the real world.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12171

Loading