ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale
Track: long paper (up to 10 pages)
Keywords: logical reasoning, benchmark, dynamical systems, first-order logic, evaluation protocol, Matthews correlation coefficient, LLM evaluation, scientific reasoning, consistency, MaxSAT, solver-augmented reasoning
TL;DR: A 40,886-question FOL benchmark and bias-aware evaluation protocol (CARE) reveal that LLMs follow logical rules but fail on parameter-dependent scientific reasoning, with MaxSAT repair fixing inconsistent errors but not reasoning errors.
Abstract: Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inability to reason about parameter-dependent dynamics. We present ChaosBench-Logic v2, a 40,886-question benchmark over 165 dynamical systems with 27 FOL predicates and 78 axiom edges, together with CARE (Calibration- and Adversarial-Robust Evaluation), a protocol that surfaces these pathologies. Evaluating 14 models, we find that regime transition reasoning remains near-random (MCC=0.05) even for frontier models, while FOL deduction with given premises reaches MCC=0.52; per-family decomposition shows the proprietary advantage concentrates on cross-indicator ($\Delta$MCC=+0.40) and consistency tasks, while open-source Qwen 2.5-32B dominates indicator diagnostics (0.91 vs. 0.45). Two models exhibit negative MCC on bifurcation questions, confirmed as systematic anti-correlation via confusion matrix analysis.
Presenter: ~Noel_Thomas1
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 146
Loading