ModelBench: A Benchmark for Extracting Executable, Physics-Based Models from Scientific Literature

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Scientific AI benchmarks, Physics, LLM-as-judge, Rubric-based evaluation
TL;DR: ModelBench is a benchmark for testing whether AI systems can read physics papers and produce executable, physics-based models.
Abstract: We introduce **ModelBench**, a benchmark for evaluating whether AI systems can extract executable physics-based models from scientific literature. ModelBench couples (i) gold-standard reference models, (ii) a hierarchical, weighted binary rubric covering physics correctness, code quality, and reproduction quality, and (iii) a judge protocol that produces pass/fail scores at rubric leaves. Unlike code-generation benchmarks that test function-level correctness, ModelBench targets the end-to-end task of reconstructing physically grounded models from incomplete and underspecified scientific descriptions. We release the benchmark specification, rubric generator and judge prompts, and an initial set of 20 gold models within the field of photonic integrated circuits, alongside scripts for fully reproducible evaluation. Candidate systems are required to produce a Python implementation of the model, a plot of the fitted results, and evaluate MSE and $R^2$ metric of the fit. Using general-purpose LLMs as neutral baselines, we report aggregate scores and case studies that reveal common failure modes (e.g., constraint violations, phenomenological overfitting) and show how rubric structure aids in diagnostic evaluation. We discuss limitations (judge variance, dataset breadth, implicit-knowledge gaps) and outline a roadmap to expand domains, tighten constraint checking, and support multiple valid solutions. ModelBench provides a transparent platform for tracking scientific modeling capabilities in AI under physical and empirical constraints.
Primary Area: datasets and benchmarks
Submission Number: 9367
Loading