Automated Hypothesis Validation with Agentic Sequential Falsifications

Kexin Huang; Ying Jin; Ryan Li; Michael Y. Li; Emmanuel Candes; Jure Leskovec

Automated Hypothesis Validation with Agentic Sequential Falsifications

Kexin Huang, Ying Jin, Ryan Li, Michael Y. Li, Emmanuel Candes, Jure Leskovec

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: AI agent automates hypothesis validation by iteratively designing and executing falsification experiments under a rigorous sequential error control framework.

Abstract: Hypotheses are central to information acquisition, decision-making, and discovery. However, many real-world hypotheses are abstract, high-level statements that are difficult to validate directly. This challenge is further intensified by the rise of hypothesis generation from Large Language Models (LLMs), which are prone to hallucination and produce hypotheses in volumes that make manual validation impractical. Here we propose POPPER, an agentic framework for rigorous automated validation of free-form hypotheses. Guided by Karl Popper's principle of falsification, POPPER validates a hypothesis using LLM agents that design and execute falsification experiments targeting its measurable implications. A novel sequential testing framework ensures strict Type-I error control while actively gathering evidence from diverse observations, whether drawn from existing data or newly conducted procedures. We demonstrate POPPER on six domains including biology, economics, and sociology. POPPER delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, POPPER achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation. POPPER is freely available at https://github.com/snap-stanford/POPPER.

Lay Summary: Hypotheses — educated guesses about how the world works — are the foundation of scientific discovery. But many real-world hypotheses are abstract and hard to test directly. With AI systems like ChatGPT now generating large numbers of scientific hypotheses, the problem is growing: How do we know which ones are actually true? We introduce a new tool called Popper, named after the philosopher Karl Popper, who believed that the best way to test an idea is to try to prove it wrong. Popper is an AI system that automatically designs and runs experiments to challenge a given hypothesis. If the hypothesis can survive these rigorous tests, it becomes more trustworthy. Our system uses powerful language models as agents to create and carry out these tests — across fields as diverse as biology, economics, and sociology. Unlike traditional tools, it ensures results are statistically sound and avoids false positives. Popper can validate hypotheses as well as human scientists but does so 10 times faster. By making hypothesis testing faster and more reliable, it helps scale up scientific discovery in the age of AI.

Link To Code: https://github.com/snap-stanford/POPPER

Primary Area: Deep Learning->Large Language Models

Keywords: LLM agent, hypothesis testing, sequential decision making, safe testing, data-driven discovery, sequential error control

Submission Number: 8182

Loading