LLM Novice Uplift on Dual-Use, In Silico Biology Tasks: A Multi-Benchmark Assessment

Chen Bo Calvin Zhang; Christina Q Knight; Nicholas Kruus; Jason Hausenloy; Nathaniel Li; Aiden Kim; Yury Orlovskiy; Coleman Breen; Bryce Cai; Jasper Götting; Andrew Bo Liu; Samira Nedungadi; Paula Rodriguez; Yannis Yiming He; Zifan Wang; Seth Donoughe; Julian Michael

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks: A Multi-Benchmark Assessment

Chen Bo Calvin Zhang, Christina Q Knight, Nicholas Kruus, Jason Hausenloy, Nathaniel Li, Aiden Kim, Yury Orlovskiy, Coleman Breen, Bryce Cai, Jasper Götting, Andrew Bo Liu, Samira Nedungadi, Paula Rodriguez, Yannis Yiming He, Zifan Wang, Seth Donoughe, Julian Michael

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0

Keywords: AI Safety, Large Language Models, Biosecurity, Uplift

TL;DR: We find that novices with access to frontier LLMs are 4.16× more accurate on dual-use biosecurity tasks than those with only internet access, providing the first large-scale empirical measurement of LLM-enabled uplift in this domain.

Abstract: Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they \textit{uplift} novice users---i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were $4.16\times$ more accurate than controls (95% CI $[2.63, 6.87]$). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.

PDF: pdf

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 80

Loading