Track: AI for Science
Keywords: protein fitness prediction, bayesian optimization, uncertainty qunatification
TL;DR: We show that leveraging multiple reasonable representations of the same protein sequence significantly improves both predictive performance and uncertainty quantification.
Abstract: We improve protein fitness prediction by addressing an often-overlooked source
of instability in machine learning models: the choice of data representation.
Guided by the Predictability–Computability–Stability (PCS) framework for
veridical (truthful) data science, we construct $\textit{SP}$ (Stable and Pred-checked) predictors by
applying a prediction-based screening procedure (pred-check in PCS) to select
predictive representations, followed by ensembling models trained on each—thereby leveraging representation-level diversity. This approach
improves predictive accuracy, out-of-distribution generalization, and uncertainty
quantification across a range of model classes. Our SP variant of the recently introduced
kernel regression method, Kermut, achieves state-of-the-art performance on the
ProteinGym supervised fitness prediction benchmark: it reduces mean squared error
by up to 20\% and improves Spearman correlation by up to 10\%, with the largest improvements
on splits representing a distribution shift. We further demonstrate that SP predictors yield statistically significant improvements in in-silico protein
design tasks. Our results highlight the critical role of representation-level variability in fitness prediction and, more broadly, underscore the need to address instability throughout the entire data science lifecycle to advance protein design.
Serve As Reviewer: ~Omer_Ronen1
Submission Number: 16
Loading