Benchmarking Logical Reasoning Inconsistencies in Local Large Language Models: Evidence from Multi-Domain Evaluation
Track: tiny / short paper (up to 4 pages)
Keywords: Large Language Models (LLMs), Logical Consistency, Reasoning Evaluation, Compositional Generalization, Quantization Effects, Multi-Domain Benchmarking
TL;DR: Local LLMs show weak logical reasoning (48–60%) despite strong ethics performance (84–92%). We find consistency failures, contrapositive errors, and transitivity violations, suggesting reliance on pattern matching over true logical inference.
Abstract: We present systematic evidence of logical reasoning limitations in local large language models through MREB (Multimodal Reasoning and Ethics Benchmark), focusing on deduction, induction, and consistency across related questions. Our evaluation of four prominent local models reveals significant logical reasoning deficits, with performance ranging from 48-60% on logical tasks while achieving 84-92% on ethics questions that require similar reasoning patterns. We identify three critical failure modes: (1) inconsistent logical deduction across semantically equivalent problems, (2) failure to maintain logical consistency when reasoning about related scenarios, and (3) systematic bias toward pattern matching over genuine logical inference. Our findings demonstrate that current local LLMs exhibit fundamental logical reasoning limitations that are masked by strong performance in other cognitive domains, highlighting the need for targeted logical reasoning improvements and more rigorous consistency evaluation frameworks.
Presenter: ~Tadisetty_Sai_Yashwanth1
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 104
Loading