Keywords: protein engineering, uncertainty quantification
TL;DR: we improve protein fitness prediction and uncertainty quantification through a simple and intuitive procedure
Abstract: We improve protein fitness prediction by addressing an often-overlooked source
of instability in machine learning models: the choice of data representation.
Guided by the Predictability–Computability–Stability (PCS) framework for
veridical (truthful) data science, we construct $\textit{Stable}$ predictors by
applying a prediction-based screening procedure (pred-check in PCS) to select
predictive representations, followed by ensembling models trained on each—thereby leveraging representation-level diversity. This approach
improves predictive accuracy, out-of-distribution generalization, and uncertainty
quantification across a range of model classes. Our $\textit{Stable}$ variant of the recently introduced
kernel regression method, Kermut, achieves state-of-the-art performance on the
ProteinGym supervised fitness prediction benchmark: it reduces mean squared error
by up to 20\% and improves Spearman correlation by up to 10\%, with the largest improvements
on splits representing a distribution shift. We further demonstrate that $\textit{Stable}$ predictors yield statistically significant improvements in in-silico protein
design tasks. Our results highlight the critical role of representation-level variability in fitness prediction and, more broadly, underscore the need to address instability throughout the entire data science lifecycle to advance robust protein design.
Submission Number: 35
Loading