ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language Models

Youhan Lee; Hasun Yu

ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language Models

Youhan Lee, Hasun Yu

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone

Keywords: Protein language modeling, Protein engineering, Text infilling

Abstract: Following the investigation that protein sequence determines its structure and function, engineering protein sequences allows us to optimize the functions of proteins for specific purposes such as enhancement of catalytic activity or binding affinity maturation. In protein engineering, there are many cases where the amino acids in the middle of a protein sequence are changed while maintaining the remaining residues to avoid unwanted functional changes from remaining residues. However, existing research on protein sequence design via protein language models (PLMs) has focused on modifying suffix residues by prompting prefix residues to the model or mutating the overall sequence residues. This is unsuitable for scenarios where the residues located in the middle of the sequence are to be optimized. In this work, we suggest a PLM-based framework to solve the fill-in-middle (FIM) protein engineering tasks. To evaluate the performance of PLMs on the FIM tasks, we design a novel evaluation scheme where PLMs are tasked to generate new sequences while maintaining the secondary structures. Also, we propose a new PROTein language model specialized for the Fill-In-Middle task, ProtFIM. Experiments confirm that ProtFIM performs FIM engineering efficiently, especially for alpha-helix structures, and provides decent protein representations of sequence-function relationships. Finally, we demonstrate an artificial protein sequence design framework composed of ProtFIM and a high-quality structure predictor as a novel tool to optimize protein sequences.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Machine Learning for Sciences (eg biology, physics, health sciences, social sciences, climate/sustainability )

TL;DR: We propose a new evaluation scheme and protein language model for fill-in-middle protein sequence design.

15 Replies

Loading