Keywords: benchmark, logical reasoning, LLM, natural language processing (NLP), propositional logic
TL;DR: We present JustLogic, a benchmark to measure deductive reasoning capabilities of LLMs, that is more challenging, reliable, and insightful than existing benchmarks.
Abstract: Logical reasoning is a critical component of Large Language Models (LLMs), and substantial research efforts in recent years have aimed to enhance their deductive capabilities. However, existing deductive reasoning benchmarks, which are crucial for evaluating and advancing LLMs, are inadequate due to their lack of task complexity, presence of prior knowledge as a confounder, and superficial error analysis. To address these deficiencies, we introduce JustLogic, a synthetically generated deductive reasoning benchmark designed for rigorous evaluation of LLMs. JustLogic is (i) highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures; (ii) context-independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and (iii) capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy. Our experimental results on JustLogic reveal that the performance of most state-of-the-art (SOTA) LLMs, specifically Llama3-8B (57.8\%), Llama3-70B (64.6\%), and GPT-4o (65.6\%), is significantly worse than the average human performance (73.0\%). A recently released reasoning model, OpenAI o1-preview, performed substantially better, with an accuracy of 81.0\%. However, it still lags behind the human ceiling of 100.0\%. These results demonstrate that the JustLogic benchmark is realistic and achievable for both humans and models and that there is still substantial room for improvement in the deductive reasoning capabilities of LLMs. We posit that the use of context-dependent and relatively simplistic benchmarks has misrepresented the reasoning abilities of many SOTA models. We release our open-source dataset to provide accurate evaluations of model performance in deductive reasoning and to facilitate LLM advancement through in-depth error analysis.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6386
Loading