MED-MAD: Breaking Medical Mental Set with Mindset-Diversified Multi-Agent Debate

ACL ARR 2026 January Submission4600 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: medical question answering, clinical reasoning, multi-agent systems, debate-based reasoning, large language models, evaluation methodology
Abstract: Large language models exhibit strong performance in medical question answering yet remain vulnerable to persistence bias, wherein early diagnostic hypotheses anchor subsequent reasoning and induce correlated, overconfident errors. We conceptualize this failure mode as the medical mental set and quantify its impact by measuring trajectory collapse using answer agreement and rationale similarity across repeated inferences. Prior work inadequately addresses this bias, lacking quantitative diagnostics, failing to promote hypothesis diversity, and relying on structurally homogeneous multi-agent frameworks that are insufficient to mitigate correlated diagnostic errors. To address these challenges, we introduce MED-MAD, an anonymous multi-agent framework designed to enhance hypothesis-level diversity while ensuring auditability. The framework incorporates mindset-specialized clinical roles, structured anonymous debate emphasizing safety-oriented critique and concession, optional backbone heterogeneity, and confidence-aware aggregation reinforced by counterfactual consistency checks. These components work together to promote rigorous and diverse diagnostic reasoning while minimizing correlated errors. Experimental evaluations demonstrate that MED-MAD consistently improves diagnostic accuracy and reduces correlated errors across benchmark datasets, under matched inference budgets. These findings highlight its potential to support safer and more reliable clinical reasoning, advancing the role of LLMs in medical decision support.
Paper Type: Long
Research Area: Clinical and Biomedical Applications
Research Area Keywords: medical question answering, multi-agent systems, clinical reasoning, large language models
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 4600
Loading