Stabilizing protein fitness predictors via the PCS framework

Published: 12 Jun 2025, Last Modified: 06 Jul 2025EXAIT@ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: AI for Science
Keywords: protein fitness prediction, bayesian optimization, uncertainty qunatification
TL;DR: We show that leveraging multiple reasonable representations of the same protein sequence significantly improves both predictive performance and uncertainty quantification.
Abstract: We improve protein fitness prediction by addressing an often-overlooked source of instability in machine learning models: the choice of data representation. Guided by the Predictability–Computability–Stability (PCS) framework for veridical (truthful) data science, we construct $\textit{SP}$ (Stable and Pred-checked) predictors by applying a prediction-based screening procedure (pred-check in PCS) to select predictive representations, followed by ensembling models trained on each—thereby leveraging representation-level diversity. This approach improves predictive accuracy, out-of-distribution generalization, and uncertainty quantification across a range of model classes. Our SP variant of the recently introduced kernel regression method, Kermut, achieves state-of-the-art performance on the ProteinGym supervised fitness prediction benchmark: it reduces mean squared error by up to 20\% and improves Spearman correlation by up to 10\%, with the largest improvements on splits representing a distribution shift. We further demonstrate that SP predictors yield statistically significant improvements in in-silico protein design tasks. Our results highlight the critical role of representation-level variability in fitness prediction and, more broadly, underscore the need to address instability throughout the entire data science lifecycle to advance protein design.
Serve As Reviewer: ~Omer_Ronen1
Submission Number: 16
Loading