When Wrong Answers Look Right: Multi-Agent Debate for High-Precision Verification of Hard Competition Mathematics
Keywords: multi-agent debate, mathematical verification, LLM reasoning, benchmark construction, answer verification
Abstract: High-quality mathematical reasoning data are central to both training and evaluation, but such data are only useful when their answers and reasoning traces are reliable. Candidate solutions may be web-scraped, human-written, adapted from existing competition problems, or generated by LLMs; in all cases, a fluent but subtly wrong solution can silently contaminate downstream use. We present a high precision multi-agent debate pipeline for blind verification of competition-level mathematics, in which five heterogeneous agents independently analyze a candidate answer, exchange structured arguments for up to five rounds, and reach a verdict governed by an assessment-gating criterion that requires positive confirming evidence rather than mere non-refutation. On 195 hard-to-verify variants from 58 IMO/USAMO/Putnam-level problems, a single-judge baseline achieves 55.1\% precision; our pipeline achieves **92.5\%** precision---an 8.3$\times$ false-positive reduction---with 59.8\% problem-level accuracy. A 2$\times$2 factorial ablation shows that, after accounting for repeated verification, most of the remaining false-positive reduction comes from two architectural mechanisms: **assessment gating** and **heterogeneous agent roles**. These results establish precision-first debate as a practical filtering primitive for mathematical reasoning data, with applications to benchmark curation, reward labeling, RLAIF, and self-improving synthetic-data pipelines.
Submission Number: 202
Loading