Steering Generative Models with Experimental Data for Protein Fitness Optimization

Jason Yang; Wenda Chu; Daniel Khalil; Raul Astudillo; Bruce James Wittmann; Frances H. Arnold; Yisong Yue

Steering Generative Models with Experimental Data for Protein Fitness Optimization

Jason Yang, Wenda Chu, Daniel Khalil, Raul Astudillo, Bruce James Wittmann, Frances H. Arnold, Yisong Yue

Published: 06 Mar 2025, Last Modified: 26 Apr 2025GEMEveryoneRevisionsBibTeXCC BY 4.0

Track: Machine learning: computational method and/or computational results

Nature Biotechnology: No

Keywords: protein engineering, generative model, machine learning, discrete diffusion, guidance, reinforcement learning, language model, optimization

Abstract: Protein fitness optimization involves finding an ideal protein sequence satisfying desired quantitative properties in an astronomically large design space of possible sequences, where it is often only possible to measure real-world fitness for few (hundreds of) sequences. Existing machine learning approaches for efficiently navigating the protein design space broadly fall into two categories–discriminative (often supervised) modeling and generative modeling–each with their own strengths and weaknesses. Supervised models can be used to identify promising variants, but require predicting fitness values for all possible sequences in a design space. Generative models, on the contrary, are not hampered by the size of a design space, but historically it has been difficult to direct these models toward specific fitness goals. To address these limitations, we propose a framework for protein sequence optimization in which generative priors on natural sequences are steered with assay-labeled fitness data, taking advantage of both unlabeled and labeled data. Specifically, we evaluate discrete diffusion and language models in combination with various steering techniques such as guidance and reinforcement learning. Our computational studies on the TrpB and CreiLOV protein fitness datasets show that various methods, particularly guidance with discrete diffusion models, are effective strategies for protein fitness optimization.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Presenter: ~Jason_Yang3

Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Submission Number: 60

Loading