Venus-MAXWELL: Efficient Learning of Protein-Mutation Stability Landscapes using Protein Language Models
Keywords: Deep learning, Protein language model, Protein stability, Protein engineering
Abstract: In-silico prediction of protein mutant stability, measured by the difference in Gibbs free energy change ($\Delta \Delta G$), is fundamental for protein engineering.
Current sequence-to-label methods typically employ two-stage pipelines: (i) encoding mutant sequences using neural networks (e.g., transformers), followed by (ii) the $\Delta \Delta G$ regression from the latent representations.
Although these methods have demonstrated promising performance, their dependence on specialized neural network encoders significantly increases the complexity.
Additionally, the requirement to compute latent representations individually for each mutant sequence negatively impacts computational efficiency and poses the risk of overfitting.
This work proposes the Venus-MAXWELL framework, which reformulates mutation $\Delta \Delta G$ prediction as a sequence-to-landscape task.
In Venus-MAXWELL, mutations of a protein and their corresponding $\Delta \Delta G$ values are organized into a landscape matrix, allowing our framework to learn the $\Delta \Delta G$ landscape of a protein with a single forward and backward pass during training. To this end, we curated a new $\Delta \Delta G$ benchmark dataset with strict controls on data leakage and redundancy to ensure robust evaluation.
Leveraging the zero-shot scoring capability of protein language models (PLMs), Venus-MAXWELL effectively utilizes the evolutionary patterns learned by PLMs during pre-training.
More importantly, Venus-MAXWELL is compatible with multiple protein language models.
For example, when integrated with the ESM-IF, Venus-MAXWELL achieves higher accuracy than ThermoMPNN with 10$\times$ faster in inference speed (despite having 50$\times$ more parameters than ThermoMPNN).
The training codes, model weights, and datasets are publicly available at https://github.com/ai4protein/Venus-MAXWELL.
Primary Area: Machine learning for sciences (e.g. climate, health, life sciences, physics, social sciences)
Submission Number: 3600
Loading