Structure-Aware Language Models Trained on Ultra-Mega-Scale Metagenomic Data Improve Protein Folding Stability Prediction
Track: Machine learning: computational method and/or computational results
Nature Biotechnology: No
Keywords: Absolute Folding Stability Prediction, Structure-Aware Language Model, MGnify
Abstract: Predicting absolute protein stability remains challenging due to the limited availability of experimental datasets and the intricate interplay between sequence and structure contributions to protein stability. In this study, we experimentally measured the folding stability of 2 million high-quality, diverse metagenomic MGnify sequences using high-throughput cDNA display methods. This dataset includes 814,000 wild-type (WT) proteins and sequences with point mutations and insertions/deletions. We fine-tuned the structure-based large language models, Saprot and ESM-3, using LoRA (Low-Rank Adapter) on stability measurements, achieving a Spearman correlation of 0.87 on the MGnify test dataset. Our results demonstrate that these models can predict absolute folding stability from both insertions/deletions and mutational effects, even in non-cDNA datasets covering a wide stability range, including large proteins.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 91
Loading