When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring

When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring

ACL ARR 2026 January Submission8168 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-agent systems, Educational AI, Logic tutoring, LLM verification, Step-level feedback, Intelligent tutoring systems

Abstract: Large language models (LLMs) are increasingly used for automated tutoring, but their reliability in structured symbolic domains remains unclear. We study step-level feedback for propositional logic proofs, which require precise symbolic reasoning aligned with a learner's current proof state. We introduce a knowledge-graph–grounded benchmark of 516 unique proof states with step-level annotations and difficulty metrics. Unlike prior tutoring evaluations that rely on model self-assessment or binary correctness, our framework enables fine-grained analysis of feedback quality against verified solution paths. We evaluate three multi-agent pipelines with varying solution access: Tutor (partial solution access), Teacher (full derivation access), and Judge (verification of Tutor feedback). Our results reveal a striking asymmetry: verification improves outcomes when upstream feedback is error-prone (<70% accuracy), but \emph{degrades} performance by 4–6pp through over-specification when feedback is already reliable (>85%). Critically, we identify a shared complexity ceiling; no model or pipeline reliably succeeds on proof states exceeding complexity 4–5. These findings challenge the assumption that adding verifiers or richer context universally improves tutoring, motivating adaptive, difficulty-aware architectures that route problems by estimated complexity and upstream reliability

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: LLM agents, multi-agent systems, agent communication, agent evaluation, grounded agents

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 8168

Loading