Stabilizing protein fitness predictors via the PCS framework

ICML 2025 Workshop FM4LS Submission35 Authors

Published: 12 Jul 2025, Last Modified: 12 Jul 2025FM4LS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: protein engineering, uncertainty quantification
TL;DR: we improve protein fitness prediction and uncertainty quantification through a simple and intuitive procedure
Abstract: We improve protein fitness prediction by addressing an often-overlooked source of instability in machine learning models: the choice of data representation. Guided by the Predictability–Computability–Stability (PCS) framework for veridical (truthful) data science, we construct $\textit{Stable}$ predictors by applying a prediction-based screening procedure (pred-check in PCS) to select predictive representations, followed by ensembling models trained on each—thereby leveraging representation-level diversity. This approach improves predictive accuracy, out-of-distribution generalization, and uncertainty quantification across a range of model classes. Our $\textit{Stable}$ variant of the recently introduced kernel regression method, Kermut, achieves state-of-the-art performance on the ProteinGym supervised fitness prediction benchmark: it reduces mean squared error by up to 20\% and improves Spearman correlation by up to 10\%, with the largest improvements on splits representing a distribution shift. We further demonstrate that $\textit{Stable}$ predictors yield statistically significant improvements in in-silico protein design tasks. Our results highlight the critical role of representation-level variability in fitness prediction and, more broadly, underscore the need to address instability throughout the entire data science lifecycle to advance robust protein design.
Submission Number: 35
Loading