CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

Published: 27 Feb 2026, Last Modified: 27 Feb 2026Accepted by DMLR_Special_TrackEveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system-a critical and labor-intensive component-remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components: CFDQuery, CFDCodeBench, and FoamBench, designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NLR-Theseus/cfdllmbench/.
Video: https://drive.google.com/file/d/11lp_rNvrh56e8X-QgIW8rlJYJgjeLlSy/view?usp=sharing
Code: https://github.com/NLR-Theseus/cfdllmbench/
Submission Number: 5
Loading