Unexplored regions of the protein sequence-structure map revealed at scale by a library of “foldtuned” language models

Arjuna Subramanian; Matt Thomson

Unexplored regions of the protein sequence-structure map revealed at scale by a library of “foldtuned” language models

Arjuna Subramanian, Matt Thomson

Published: 27 Oct 2023, Last Modified: 23 Nov 2023GenBio@NeurIPS2023 PosterEveryoneRevisionsBibTeX

Keywords: large language models, protein structure prediction, protein design

TL;DR: A protein large language model is modified to trace structure-preserving paths through novel sequence-space for several hundred fundamental protein folds, including therapeutic and catalytic targets.

Abstract: Nature has likely sampled only a fraction of all protein sequences and structures allowed by the laws of biophysics. However, the combinatorial scale of amino-acid sequence-space has traditionally precluded substantive study of the full protein sequence-structure map. In particular, it remains unknown how much of the vast uncharted landscape of far-from-natural sequences consists of alternate ways to encode the familiar ensemble of natural folds; proteins in this category also represent an opportunity to diversify candidates for downstream applications. Here, we characterize sequence-structure mapping in far-from-natural regions of sequence-space guided by the capacity of protein language models (pLMs) to explore sequences outside their natural training data through generation. We demonstrate that pretrained generative pLMs sample a limited structural snapshot of the natural protein universe, including >300 common (sub)domain elements. Incorporating pLM, structure prediction, and structure-based search techniques, we surpass this limitation by developing a novel "foldtuning" strategy that pushes a pretrained pLM into a generative regime that maintains structural similarity to a target protein fold (e.g. TIM barrel, thioredoxin, etc) while maximizing dissimilarity to natural amino-acid sequences. We apply "foldtuning" to build a library of pLMs for >700 naturally-abundant folds in the SCOP database, accessing swaths of proteins that take familiar structures yet lie far from known sequences, spanning targets that include enzymes, immune ligands, and signaling proteins. By revealing protein sequence-structure information at scale outside of the context of evolution, we anticipate that this work will enable future systematic searches for wholly novel folds and facilitate more immediate protein design goals in catalysis and medicine.

Submission Number: 81

Loading