Consistent Biases in Large Language Models' Syllogistic Reasoning

Published: 15 Nov 2025, Last Modified: 08 Mar 2026AAAI 2026 Bridge LMReasoningEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Syllogistic reasoning, Large language models (LLMs), Logical reasoning
Abstract: Large language models (LLMs) have demonstrated impressive performance across a wide range of language understanding tasks. However, whether they can truly reason logically over natural language remains an open question. While prior studies have evaluated LLMs on various formal logic benchmarks, a systematic mapping of their performance on categorical syllogisms remains under-explored, particularly regarding the interplay between logical structure and linguistic heuristics. To address this gap, we present a systematic evaluation of LLMs on categorical syllogistic reasoning, covering all four Figures and both valid and invalid Moods. We assess five representative models: GPT-4o, Gemini-2.0-Flash, LLaMA-3.3-70B, Qwen-3-Max, and DeepSeek-Chat. Our results reveal that LLMs exhibit limited formal reasoning ability and perform particularly poorly on invalid syllogisms, where linguistic plausibility conflicts with logical validity. Moreover, all models show a consistent bias toward the syntactic position of the middle term, suggesting that their reasoning relies on surface linguistic cues rather than abstract logical structures. We hope this work provides a foundation for more rigorous evaluation and improvement of logical reasoning in future language models.
Submission Number: 19
Loading