Keywords: Dynamic Evaluation, Structured Reasoning, Bias Diagnosis, Synthetic data
TL;DR: We present AutoBench, a dynamic, synthetic physics benchmark that enables controlled, experimental auditing of LLM reasoning, robustness, and failure modes at scale.
Abstract: Rigorous auditing of large language models (LLMs) demands dynamic, controllable evaluations, especially in scientific domains like physics where precise reasoning is essential. We introduce AutoBench, a novel synthetic benchmark for experimental auditing of LLM reasoning, robustness, and failure modes. AutoBench comprises 15K university-level physics problems generated through a rigourous process, each paired with structured step-by-step reasoning, symbolic Python code, and final computed answers. The dataset is fully dynamic: each problem is parameterized, allowing controlled variation of inputs with automatic solution updates via the associated code. Additionally, multiple paraphrased variants of each problem enable systematic perturbations of linguistic structure. These features support fine-grained experiments to test generalization, diagnose biases, and uncover brittle reasoning behaviors, advancing beyond traditional static benchmarks. In addition, since the dataset is fully dynamic, we can hide key components and measure hallucination, and how models behave in uncetainty. We further evaluate a range of state-of-the-art instruction-tuned LLMs on AutoBench, revealing new insights into their scientific reasoning capabilities and failure patterns. Our work highlights the importance of dynamic synthetic datasets for principled, experimental auditing of model behavior.
Submission Number: 8
Loading