Evaluating LLMs on Syllogistic Reasoning: A Human–Model Accuracy Comparison under Premise Order Effects
Keywords: LLM, Benchmark, Syllogistic Reasoning, Human Comparison
TL;DR: This study reveals that while LLMs achieve higher accuracy than humans on reasoning tasks, they lack logical consistency and exhibit unpredictable fluctuations.
Abstract: Large Language Models (LLMs) have achieved remarkable success on many natural language tasks; however, their performance in logical reasoning is still unsatisfactory. The aim of this study is to evaluate LLMs’ sensitivity to premise order in logical reasoning within the classic paradigm of categorical syllogisms, and to benchmark their accuracy against a human baseline. We constructed a test set of 64 natural language syllogisms with a dual-order design. Human participants (N=1317) randomly completed 32 items (16 forward-order, 16 reverse-order), while 12 LLMs completed all 64 items. Using accuracy as the sole metric, we defined the order effect as $\Delta acc = Acc_{forward} - Acc_{reverse}$ and conducted statistical analyses at the overall, per-figure, and per-form levels across 32 logical forms. The results show that while the human group exhibits no overall order effect, LLMs as a whole display a weak and non-systematic effect with inconsistent directions across different models. Furthermore, the models proved to be more fragile on logical forms that were challenging for both humans and machines. The human–LLMs correlation of $\Delta acc$ across the 32 formats is nearly zero, and the directional agreement rate is not significantly higher than chance. Our work provides a new conceptual framework and an empirical benchmark for investigating intrinsic limitations of LLM reasoning.
Submission Number: 40
Loading