Track: long paper (up to 10 pages)
Keywords: Modal Logic, Deontic Logic, Kripke Semantics, LLM Benchmarking, Logical Reasoning, Axiomatic Bias, Formal-Natural Language Gap.
TL;DR: ModalBench is the first Kripke-grounded benchmark evaluating modal and deontic logic in LLMs. Results reveal fragile axiom reasoning , implicit S5 biases , and asymmetric formal-to-natural language gaps.
Abstract: Large language models excel at propositional and first-order logic, yet no benchmark tests whether they can reason about necessity, possibility, obligation, or permission—the domain of modal and deontic logic. We introduce ModalBench, the first benchmark grounded in Kripke semantics for evaluating modal reasoning in LLMs. ModalBench consists of 1,500 problems spanning five modal systems (K, T, S4, S5, and deontic D), three difficulty tiers, and two presentation tracks (formal symbolic and natural language), each with computationally verified ground truth. We evaluate three LLMs with three prompting strategies (27,000 total inferences) and find: (1) under chain-of-thought, models achieve 73–91% accuracy, with system K (unconstrained accessibility) as the hardest at 65–75%; (2) LLMs reason about modality better in natural language than in formal notation—two of three models score significantly higher on narrative presentations, with Qwen3-235B showing an 18.4 pp advantage for natural language (p < 10−99); (3) world-enumeration prompting acts as a scaffold: it boosts weaker models by 9–11 pp but is unnecessary for the strongest model, which peaks at 95.9% with zero-shot prompting; (4) all models exhibit implicit S5 bias, over-applying the Euclidean axiom by up to +56 pp in system K where it is not valid. We release ModalBench, the evaluation code, and all results at https://github.com/mujtabahasan/modalbench.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 212
Loading