Reasoning Models Outperform Standard Language Models in De Novo Protein Design

Alfred Greisen; Longfei Cong; Per Jr. Greisen; Sergey Ovchinnikov

Reasoning Models Outperform Standard Language Models in De Novo Protein Design

Alfred Greisen, Longfei Cong, Per Jr. Greisen, Sergey Ovchinnikov

Published: 08 Oct 2025, Last Modified: 30 Oct 2025Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0

Keywords: ChatGPT, AlphaFold, Protein Design, CD

TL;DR: Large language models can design stable, correctly folded proteins from a single natural-language prompt, as confirmed by AlphaFold predictions and experimental CD validation.

Abstract: We compared reasoning-enhanced versus standard large language models for de novo protein design using four-helix bundles as a benchmark. Testing five ChatGPT variants with identical prompts, we discovered a dramatic capability divide: reasoning models (o3, o4-mini) achieved 44\% and 20\% success rates respectively, while all standard language models achieved 0\% success using pLDDT > 75 as the threshold. Two of the top-scoring sequences were experimentally tested. One was validated via circular dichroism (CD) spectroscopy, confirming α-helical structure. These results suggest reasoning capabilities, not just model scale, are critical for complex scientific tasks like protein design.

Supplementary Material: zip

Submission Number: 256

Loading