LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models

LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models

ACL ARR 2026 January Submission3917 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reasoning, logic, deductive reasoning, AI reasoning benchmarks, symbolization, countermodel construction, validity assessment, model-theoretic reasoning, logic benchmarks, large language models, solver-verified evaluation, first-order logic, controlled natural language

Abstract: Large language models have demonstrated notable performance across various logical reasoning benchmarks. However, it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a unified benchmark designed to isolate three fundamental skills in formal reasoning: (i) formal symbolization—translating premises into first-order logic; (ii) countermodel construction—formulating a finite structure in which all premises are true while the conclusion is false; and (iii) validity assessment—deciding whether a conclusion follows from a given set of premises. Items are drawn from the two-variable fragment of first-order logic (without identity) and are presented in both natural English and a Carroll-style language with nonce words. All examples are verified for correctness and non-triviality using the SMT solver Z3. Across leading models, performance is high on validity but substantially lower on symbolization and countermodel construction, suggesting reliance on surface-level patterns rather than genuine symbolic or rule-based reasoning.

Paper Type: Short

Research Area: Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning

Research Area Keywords: benchmarking, logical reasoning, evaluation methodologies, NLP datasets, reasoning, automatic creation and evaluation of language resources

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English, Carrollian (constructed nonce-word language)

Submission Number: 3917

Loading