Propose, Critique, Falsify: Benchmarking Self-Verifying AI Scientists

Published: 30 May 2026, Last Modified: 30 May 2026ICML2026-AI4Science SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 1: Original Research/Position/Education/Attention Track
TL;DR: We benchmark multi-agent propose-critique-falsify pipelines for scientific claim verification and find they induce model-dependent conservatism-as-abstention rather than reliably reducing false discoveries.
Abstract: AI systems that autonomously generate scientific hypotheses are proliferating, yet the verification of their claims remains largely unexamined. We introduce a Propose-Critique-Falsify (PCF) benchmark that evaluates whether multi-agent pipelines can reduce the false discovery rate of AI-generated scientific claims. Across three evaluation domains (biomedical claim verification, synthetic statistical experiments, and scientific novelty assessment) and five frontier-class language models, we find that PCF pipelines dramatically increase false negatives through a mechanism we term conservatism-as-abstention: models emit Uncertain verdicts rather than risk incorrect classifications, rendering the false discovery rate undefined in many conditions. Critically, the direction of this error is model-dependent. Claude Sonnet 4 and GPT-4o become over-conservative under PCF, while GPT-5.4 becomes over-permissive (FDR = 0.75), labeling nearly all claims as valid. We further confirm that iterative self-refinement consistently increases false discovery rates across all models tested. Our results reveal that multi-agent verification architectures introduce systematic, model-specific biases that cannot be resolved through pipeline design alone.
Keywords: AI for Science, Multi-Agent Systems, Scientific Verification, False Discovery Rate, Benchmarking, LLM Self-Correction, Falsification, Autonomous Discovery
Submission Number: 152
Loading