Reasoning Models Outperform Standard Language Models in De Novo Protein Design

Published: 08 Oct 2025, Last Modified: 20 Oct 2025Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0
Keywords: ChatGPT, AlphaFold, Protein Design, CD
TL;DR: Large language models can design stable, correctly folded proteins from a single natural-language prompt, as confirmed by AlphaFold predictions and experimental CD validation.
Abstract: We compared reasoning-enhanced versus standard large language models for de novo protein design using four-helix bundles as a benchmark. Testing five ChatGPT variants with identical prompts, we discovered a dramatic capability divide: reasoning models (o3, o4-mini) achieved 44\% and 20\% success rates respectively, while all standard language models achieved 0\% success using pLDDT > 75 as the threshold. Two of the top-scoring sequences were experimentally tested. One was validated via circular dichroism (CD) spectroscopy, confirming α-helical structure. These results suggest reasoning capabilities, not just model scale, are critical for complex scientific tasks like protein design.
Supplementary Material: zip
Submission Number: 256
Loading