ESMGain: Effective and Efficient Prediction of Mutation’s functional Effect via ESM2 Transfer Learning and robust Benchmarks
Keywords: protein, language model, deep learning, biology, gain of function, enzyme
TL;DR: ESMGain leverages ESM2 fine-tuning to predict functional effects of mutations, outperforming competitors by task-specific optimization, and introduces a benchmarking framework with harmonic Spearman as an accurate metric across effect types.
Abstract: Functional effect prediction of mutations, especially for properties like catalytic activity, holds greater significance for clinicians and protein engineers than traditional pathogenicity predictions. Recent approaches leveraging static ESM1 embeddings or multimodal features (e.g. embeddings, structures, and evolutionary data) either (1) fall short in accuracy or (2) involve complex preprocessing pipelines. Moreover, functional effect prediction suffers from (3) a lack of standardized datasets and metrics for robust benchmarking. We address these challenges by systematically optimizing ESM2-based functional effect prediction: Through extensive ablation studies, we demonstrate that fine-tuning significantly outperforms static embeddings, scaling laws for model size are non-transferable and LoRA matches full fine-tuning performance, deviating from trends observed in natural language processing. Our framework, ESM-Effect, fine-tunes 35M ESM2 layers with an inductive bias regression head achieving state-of-the-art performance. It slightly surpasses multimodal competitor PreMode indicating redundancy in structural and evolutionary features. We further propose a benchmarking framework featuring robst test datasets and strategies, and the relative Bin-Mean Error (rBME), as a metric designed to emphasize prediction accuracy in challenging, non-clustered, and rare gain-of-function regions. rBME better reflects model performance compared to commonly used Spearman’s rho, as evidenced by improved plot-based analyses. As ESM-Effect exhibits mixed transferability to different unseen mutational regions, we identify multiple areas for improvement such as finer-grained pretraining strategies.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 14247
Loading