Conditioning Protein Language Models Using High-Throughput Sequence-Fitness Data Collection

Sonia C. Yuan; Jason Yang; Jinbei Li; Bastian Vogeli; Simon R. Krarup; Emily D. Roberts; Bjarke Erichsen; Vanessa Hurtado Mujica; Kenan Jijakli; Søren Karst; Lei Yang; Alex Toftgaard Nielsen; Tyler Paz Korman; Frances H. Arnold

Conditioning Protein Language Models Using High-Throughput Sequence-Fitness Data Collection

Sonia C. Yuan, Jason Yang, Jinbei Li, Bastian Vogeli, Simon R. Krarup, Emily D. Roberts, Bjarke Erichsen, Vanessa Hurtado Mujica, Kenan Jijakli, Søren Karst, Lei Yang, Alex Toftgaard Nielsen, Tyler Paz Korman, Frances H. Arnold

Published: 02 Mar 2026, Last Modified: 14 Apr 2026GEM 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: protein language model, reinforcement learning, DPO, enzyme engineering, high-throughput sequence-fitness data

TL;DR: We applied direct preference optimization to a protein language model using a high-throughput sequence-fitness dataset of O-methyltransferases.

Abstract: Current generative models of protein sequences, such as protein language models (pLMs), can generate novel functional sequences, but most strategies do not integrate labeled fitness data from real-world experiments. In this study, we explore fitness-conditioned generation from an autoregressive pLM, capturing evolutionary information from a protein family, using direct preference optimization (DPO) with large amounts of real-world experimental data. Our method leverages MillionFull, a high-throughput method used to collect over 100,000 unique sequence-fitness pairs for O-methyltransferases (OMTs) that form isovanillic acid, a non-native reaction. Specifically, we pretrain ProGen2 on natural OMTs, after which we use the MillionFull-collected labeled dataset to align the pLM to generate sequences with higher fitness. This DPO-conditioned model generates sequences with significantly higher predicted fitness than the pretrained model while maintaining high sequence diversity and mutational profiles consistent with top-performing experimental variants. Impressively, wet-lab validation confirms that the best-performing DPO variant has a 16-fold fitness increase from the parent sequence and a 3-fold increase from the top variant in the training data. Overall, we demonstrate a robust "lab-in-the-loop" framework capable of generating diverse, high-fitness enzyme variants for non-native functional targets.

Submission Number: 41

Loading