VERDICT: Unveiling the Capability Threshold in Multi-Agent Logical Reasoning

Jasper T

VERDICT: Unveiling the Capability Threshold in Multi-Agent Logical Reasoning

Jasper T

Published: 28 Dec 2025, Last Modified: 08 Mar 2026AAAI 2026 Bridge LMReasoningEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Logical Reasoning, Knights and Knaves, Abductive Reasoning, Deductive Reasoning, Large Language Models (LLMs), Chain-of-Thought (CoT), Multi-Agent Debate (MAD), LLM Judgement

Abstract: We introduce VERDICT (Verifiable Evaluation of Reasoning In Complex Truth-puzzles), a logical reasoning benchmark designed to rigorously evaluate the deductive capabilities of LLMs. VERDICT introduces higher-dimensional complexity through a 999-puzzle suite featuring six distinct character types across three difficulty tiers. We employ six LLMs using three prompting strategies: One-Shot, Chain-of-Thought (CoT), and Multi-Agent Debate (MAD). Our results reveal a critical "Capability Threshold": while MAD significantly boosts accuracy for capable models like Gemini-2.5-Pro, it fails completely for weaker models, inducing "hallucination loops" where agents reinforce each other's errors. These findings suggest that while multi-agent architectures can act as "reasoning multipliers," they are not a universal remedy for fundamental reasoning deficits.

Submission Number: 93

Loading