Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback

Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback

ICLR 2026 Conference Submission16904 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Experiment-Guided Ranking, Large Language Models (LLMs)

Abstract: Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet--lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on a language model’s internal reasoning without incorporating empirical outcomes. We introduce the task of experiment-guided ranking, which prioritizes hypotheses based on feedback from previously tested ones. However, developing such strategies in natural science domains is challenging due to the impractical requirement of repeatedly conducting real experiments. To address this, we revisit the core purpose of real experiments: to provide feedback on both the groundtruth hypothesis and the surrounding hypotheses that form the path toward it. This motivates our alternative: a simulator grounded in three domain-informed conceptual foundations, modeling hypothesis performance as a function of similarity to a known ground truth, perturbed by noise. While the groundtruth is pre-specified, it remains hidden from the ranking agent, enabling faithful evaluation of policies that navigate toward it. Validated against 124 hypotheses with experimentally reported outcomes, the simulator approximates real experimental results with consistent trend alignment. Though not perfectly accurate, its deviations resemble wet-lab noise and can foster more robust ranking strategies. We formulate experiment-guided ranking as a sequential decision-making problem and propose an in-context reinforcement learning (ICRL) framework. Within this framework, we introduce an LLM-based agentic policy that decomposes hypotheses into functional elements, clusters them by shared mechanistic roles, and prioritizes recombinations of promising elements based on feedback. Experiments show that our method significantly outperforms pre-experiment baselines and strong ablations. Our toolkit—comprising the simulator and ICRL framework—enables systematic research on experiment-guided ranking, with our policy serving as a strong proof of concept.

Supplementary Material: zip

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Submission Number: 16904

Loading