JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in LLMs

09 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmark, logical reasoning, LLM, natural language processing (NLP), propositional logic
TL;DR: We present JustLogic, a benchmark to measure deductive reasoning capabilities of LLMs, that is more challenging, reliable, and insightful than existing benchmarks.
Abstract: Logical reasoning is a critical component of Large Language Models (LLMs), and substantial research efforts in recent years have aimed to enhance their deductive reasoning abilities. However, existing deductive reasoning benchmarks, which are crucial for evaluating and advancing LLMs, suffer from fundamental limitations that severely restrict their utility, i.e., the lack of task complexity, the presence of prior knowledge as a confounder, and superficial error analysis. To address these deficiencies, we introduce JustLogic, a synthetically generated benchmark designed for rigorous evaluation of LLMs. JustLogic is (i) highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures; (ii) prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and (iii) capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy. Our experimental results on JustLogic reveal that (i) state-of-the-art (SOTA) reasoning LLMs perform on par or better than the human average but significantly worse than the human ceiling, and (ii) SOTA non-reasoning models still underperform the human average. All code and data are available at https://github.com/michaelchen-lab/JustLogic
Croissant File: json
Dataset URL: https://huggingface.co/datasets/michaelchenkj/JustLogic
Code URL: https://github.com/michaelchen-lab/JustLogic
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 876
Loading