DebateSim: CoT Drift in Multi-Agent Debate Systems in an Architectural and Empirical Study

Mrinal Agarwal; Alex Liao; Arnav Kakani

DebateSim: CoT Drift in Multi-Agent Debate Systems in an Architectural and Empirical Study

Mrinal Agarwal, Alex Liao, Arnav Kakani

14 Sept 2025 (modified: 17 Sept 2025)Agents4Science 2025 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-agent debate systems, chain-of-thought (CoT), CoT drift, legislative analysis, democratic discourse, AI judge, context persistence, citation grounding, Congress.gov retrieval, process-level evaluation, drift analysis, reproducibility

TL;DR: DebateSim is a multi-agent system for legislative debates that tracks how arguments evolve across rounds, showing that structured prompts and context memory make AI debates more factual, consistent, and easier to follow.

Abstract: Democratic discourse increasingly unfolds across digital venues where citizens face three compounding obstacles: (i) legislative texts are long, technical, and cross-reference complex statutory regimes that are hard to parse without training \citep{Kornilova2019BillSum,LegalBench2023}, (ii) online debate often privileges speed, virality, and polarization over structured, evidence-grounded argumentation \citep{Allen2020FakeNews,Bail2020PrismBook}, and (iii) access barriers persist for non-experts who lack tools to interrogate policy at scale \citep{Wang2023AIpolicyReview}. Large language models (LLMs) can help summarize, critique, and reason over policy \citep{Zhang2024CollaborativeSynthesis,Johnson2023AutomatedLegis}, but single-agent pipelines struggle with multi-perspective synthesis, adversarial engagement, and longitudinal consistency \citep{Irving2018AISafetyDebate,Li2023CAMEL}. We present \textbf{DebateSim}, a multi-agent architecture for legislative analysis and structured debate generation. DebateSim integrates role-specialized agents (Pro/Con debaters, AI judges, and memory managers), a Congress.gov–backed data pipeline for evidence grounding, and a context-persistence layer that enforces cross-round coherence. Unlike prior work that evaluates isolated turns or static summaries \citep{Kornilova2019BillSum,LegalBench2023}, DebateSim operationalizes debate as a \emph{process}: agents must cite, rebut, weigh, and update claims across five rounds, while an AI judge produces rubric-based feedback \citep{Zheng2023JudgingLLMs,Bai2022ConstitutionalAI}. On two complex topics—H.R.~40 (reparations study) and H.R.~1 (comprehensive legislation)—DebateSim achieves \textbf{100\%} structural compliance (exactly three labeled arguments in openings), \textbf{89\%} citation accuracy against source texts, and a \textbf{+23 pp} improvement in rebuttal-reference rate from early to late rounds, with stable latencies (avg \textbf{17.7s} per turn) over \textbf{25} total rounds. These findings indicate that multi-agent, role-specialized orchestration can improve argumentative structure and evidence usage relative to single-turn analyses, helping democratize legislative understanding while preserving transparency through full transcripts and JSON artifacts. All code utilized in this project is disclosed at \url{https://anonymous.4open.science/r/cot-debate-drift-3EF6/README.md}.

Submission Number: 166

Loading