MutaGen: Implicitly Guided Protein Evolution from Ranked Feedback via Pair-Based Discrete Flow Matching
Keywords: Protein engineering, directed evolution, discrete flow matching, machine learning–guided protein evolution
TL;DR: MutaGen enables data-efficient protein optimization from ranked variants alone via discrete flow matching, achieving strong benchmarks and a 80-fold experimental gain with as few as 20 variants per round.
Abstract: Machine learning-directed evolution (MLDE) aims at democratizing protein engineering, enabling optimization of any protein with any assay at accessible cost by drastically reducing the screening of thousands of protein sequences. In this work, we introduce a novel discrete flow-matching (DFM) method, MutaGen, trained to iteratively mutate protein sequences towards high-fitness regions of the protein fitness landscape, without relying on noisy in-silico fitness predictions. Training minimizes a token-level cross-entropy flow-matching loss to learn a vector field of improvement from ranked sequence pairs alone. Across realistic screening budgets, MutaGen enables multi-mutational protein optimization with minimal data (as low as 20 sequences per round of evolution) while bypassing the need for an explicit fitness predictor. We validate our approach on standard in silico benchmarks (GFP and AAV) and experimentally in a four-round campaign on NanoLuc, achieving an >80-fold increase in luminescence over the wild-type.
Presenter: ~Paolo_L.B._Fischer1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 77
Loading