Keywords: reasoning, logic, deductive reasoning, AI reasoning benchmarks, symbolization, countermodel construction, validity assessment, model-theoretic reasoning, logic benchmarks, large language models, solver-verified evaluation, first-order logic, controlled natural language
Abstract: Large language models have demonstrated notable performance across various logical reasoning benchmarks. However, it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a unified benchmark designed to isolate three fundamental skills in formal reasoning: (i) formal symbolization—translating premises into first-order logic; (ii) countermodel construction—formulating a finite structure in which all premises are true while the conclusion is false; and (iii) validity assessment—deciding whether a conclusion follows from a given set of premises. Items are drawn from the two-variable fragment of first-order logic (without identity) and are presented in both natural English and a Carroll-style language with nonce words. All examples are verified for correctness and non-triviality using the SMT solver Z3. Across leading models, performance is high on validity but substantially lower on symbolization and countermodel construction, suggesting reliance on surface-level patterns rather than genuine symbolic or rule-based reasoning.
Paper Type: Short
Research Area: Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning
Research Area Keywords: benchmarking, logical reasoning, evaluation methodologies, NLP datasets, reasoning, automatic creation and evaluation of language resources
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English, Carrollian (constructed nonce-word language)
Submission Number: 3917
Loading