WetBench: LLM-Based Simulation Environment to Evaluate Wet-Lab Experiment Planning and Design

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: ai for science, wet-lab experimental design
Abstract: We introduce WetBench, an LLM-based simulation environment for scalably evaluating AI systems' ability to design and plan wet-lab experiments. Traditional evaluation approaches are limited by the expense and safety concerns of executing AI-generated experiments in physical laboratories. To address this, we developed a simulation environment that uses LLMs as state transition models to simulate experimental outcomes and as evidence classifiers to evaluate whether experiments provide sufficient information to achieve stated goals. WetBench includes 18 expert-curated experimental configurations spanning cell biology, neuroscience, microbiology, and analytical chemistry, each validated as solvable within the environment’s constraints. We evaluated the fidelity of our LLM-based simulation through expert ratings, finding that state transitions were judged as highly plausible (>90\% plausibility) by human expert raters. Evidence classification showed substantial agreement between LLM classifiers and human experts (72-82\% agreement), on par with inter-human baseline agreement (75\%). Using this environment, we benchmarked frontier language models on experimental design and planning capabilities. GPT-5 demonstrated superior performance with a 44.4\% pass@1 rate that increased to 72.2\% at pass@5, substantially outperforming other models, including Gemini 2.5 Flash (50.0\% pass@5), Qwen 3 (41.2\% pass@5), and Claude Sonnet 4 (27.8\% pass@5). We open-source WetBench as a Python gymnasium environment to support further development of AI systems for autonomous scientific experimentation.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 23138
Loading