Keywords: Logical Reasoning, Knights and Knaves, Abductive Reasoning, Deductive Reasoning, Large Language Models (LLMs), Chain-of-Thought (CoT), Multi-Agent Debate (MAD), LLM Judgement
Abstract: We introduce VERDICT (Verifiable Evaluation of Reasoning In Complex Truth-puzzles), a logical reasoning benchmark designed to rigorously evaluate the deductive capabilities of LLMs. VERDICT introduces higher-dimensional complexity through a 999-puzzle suite featuring six distinct character types across three difficulty tiers. We employ six LLMs using three prompting strategies: One-Shot, Chain-of-Thought (CoT), and Multi-Agent Debate (MAD). Our results reveal a critical "Capability Threshold": while MAD significantly boosts accuracy for capable models like Gemini-2.5-Pro, it fails completely for weaker models, inducing "hallucination loops" where agents reinforce each other's errors. These findings suggest that while multi-agent architectures can act as "reasoning multipliers," they are not a universal remedy for fundamental reasoning deficits.
Submission Number: 93
Loading