ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

Noel Thomas

ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

Noel Thomas

Published: 05 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: logical reasoning, benchmark, dynamical systems, first-order logic, evaluation protocol, Matthews correlation coefficient, LLM evaluation, scientific reasoning, consistency, MaxSAT, solver-augmented reasoning

TL;DR: A 40,886-question FOL benchmark and bias-aware evaluation protocol (CARE) reveal that LLMs follow logical rules but fail on parameter-dependent scientific reasoning, with MaxSAT repair fixing inconsistent errors but not reasoning errors.

Abstract: Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inability to reason about parameter-dependent dynamics. We present ChaosBench-Logic v2, a 40,886-question benchmark over 165 dynamical systems with 27 FOL predicates and 78 axiom edges, together with CARE (Calibration- and Adversarial-Robust Evaluation), a protocol that surfaces these pathologies. Evaluating 14 models, we find that regime transition reasoning remains near-random (MCC=0.05) even for frontier models, while FOL deduction with given premises reaches MCC=0.52; per-family decomposition shows the proprietary advantage concentrates on cross-indicator ($\Delta$MCC=+0.40) and consistency tasks, while open-source Qwen 2.5-32B dominates indicator diagnostics (0.91 vs. 0.45). Two models exhibit negative MCC on bifurcation questions, confirmed as systematic anti-correlation via confusion matrix analysis.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.

Submission Number: 146

Loading