['6,12c6,12', '< Accurately predicting protein variant effects is crucial for both advancing biological understanding and for engineering and optimizing proteins towards specific traits. Recently, much progress has been made in the field as a result of advances in machine learning-driven modeling [1][2][3], data availability [4,5], and relevant benchmarks [6,7].', '< While prediction accuracy has received considerable attention, the ability to quantify the uncertainties of predictions has been less intensely explored. This is of immediate practical consequence. One of the main purposes of protein variant effect prediction is as an aid for protein engineering and design, to propose promising candidates for subsequent experimental characterization. For this purpose, it is essential that we can quantify, on an instance-to-instance basis, how trustworthy our predictions are. Specifically, in a Bayesian optimization setting, most choices of acquisition function actively rely on predicted uncertainties to guide the optimization, and well-calibrated uncertainties have been shown to correlate with optimization performance [8].', '< Our goal with this paper is to start a discussion on the quality of the estimated uncertainties of supervised protein variant effect prediction. Gaussian Processes (GP) are a standard choice for uncertainty quantification due to the closed form expression of the posterior. We therefore first ask the question whether state-of-the-art performance can be obtained within the GP framework. We propose a composite kernel that comfortably achieves this goal, and subsequently investigate the quality of the uncertainty estimates from such a model. Our results show that while standard approaches like reliability diagrams give the impression of good levels of calibration, the quantification of per-instance uncertainties is more challenging. We make our model available as a baseline and encourage the community to place greater emphasis on uncertainty quantification in this important domain. Our contributions can be summarized as follow:', '< • We introduce Kermut, a Gaussian process with a novel composite kernel for modeling mutation similarity, leveraging signals from pretrained sequence and structure models;', '< • We evaluate this model on the comprehensive ProteinGym substitution benchmark and show that it is able to reach state-of-the-art performance in supervised protein variant effect prediction, outperforming recently proposed deep learning methods in this domain;', '< • We provide a thorough calibration analysis and show that while Kermut provides wellcalibrated uncertainties overall, the calibratedness of instance-specific uncertainties remains challenging;', '< • We demonstrate that our model can be trained and evaluated orders of magnitude faster and with better out-of-the-box calibration than competing methods.', '---', '> The precise prediction of protein variant effects stands as a cornerstone in both fundamental biological research and the applied domain of protein engineering, enabling the rational design and optimization of proteins for diverse biotechnological and therapeutic applications. The field has witnessed a rapid acceleration in recent years, primarily driven by breakthroughs in machine learning methodologies [1][2][3], the proliferation of large-scale experimental datasets [4,5], and the establishment of robust, standardized benchmarks [6,7].', '> Despite significant advancements in predictive accuracy, the critical capability to quantify the uncertainties associated with these predictions remains an underexplored frontier. This oversight carries immediate and substantial practical implications. In protein engineering and design, where the goal is often to identify and prioritize promising candidates for costly experimental validation, the ability to gauge the trustworthiness of each individual prediction is paramount. For instance, in Bayesian optimization, a widely adopted strategy for guiding experimental search, the efficacy of acquisition functions fundamentally relies on accurate uncertainty estimates to efficiently navigate the vast protein fitness landscapes; well-calibrated uncertainties have been demonstrably linked to superior optimization performance [8].', '> This paper addresses the pressing need for high-quality uncertainty quantification in supervised protein variant effect prediction. Gaussian Processes (GPs), with their inherent capacity for providing closed-form expressions of posterior distributions, offer a natural and powerful framework for uncertainty estimation. Our primary objective is to demonstrate that state-of-the-art predictive performance can be achieved within the GP framework. To this end, we introduce Kermut, a novel Gaussian process regression model equipped with a sophisticated composite kernel. Kermut not only achieves state-of-the-art accuracy but also provides robust estimates of uncertainty through its posterior. A rigorous analysis of these uncertainty estimates reveals that while our model exhibits meaningful levels of overall calibration, the challenge of instance-specific uncertainty calibration persists. We release Kermut as a robust, high-performance baseline and advocate for a renewed emphasis on uncertainty quantification within this vital domain. Our key contributions are:', '> • We present Kermut, a Gaussian process model featuring a novel composite kernel that effectively integrates signals from pretrained sequence and structure models to capture intricate mutation similarities;', '> • We conduct a comprehensive evaluation of Kermut on the extensive ProteinGym substitution benchmark, demonstrating its ability to achieve state-of-the-art performance in supervised protein variant effect prediction, surpassing the accuracy of several recently proposed deep learning methods;', '> • We provide an in-depth calibration analysis, revealing that Kermut delivers well-calibrated uncertainties at an aggregate level, while highlighting the ongoing challenges in achieving perfectly calibrated instance-specific uncertainties;', '> • We illustrate that Kermut offers substantial computational advantages, enabling training and evaluation orders of magnitude faster than competing deep learning approaches, coupled with superior out-of-the-box calibration properties.', '13a14', '> Gaussian Processes (GPs) have a long-standing history in machine learning, particularly valued for their ability to provide well-calibrated uncertainty estimates alongside predictions [65]. Their application spans diverse fields, including bioinformatics, where they have been used for tasks such as gene expression analysis and drug discovery. In the context of protein modeling, GPs offer a powerful non-parametric framework for capturing complex relationships within protein sequence and structure data. Recent advancements in GP scalability [72][73][74] have made them increasingly viable for larger biological datasets.', '14a16,17', '> Protein sequence and structure modeling has seen a revolution with the advent of deep learning, especially large language models (LLMs) and graph neural networks. Pretrained protein language models like ESM-2 [16], ProtTrans [15], and SaProt [17] have demonstrated remarkable capabilities in learning rich, contextual embeddings from vast unannotated protein sequence data. These embeddings capture evolutionary and biochemical signals that are highly predictive of various protein properties. Similarly, structure-based models, such as inverse folding networks like ProteinMPNN [52], leverage known protein structures to predict amino acid sequences, thereby encoding information about local structural environments and their physicochemical constraints. The integration of these advanced representations into downstream predictive models, including kernel methods, offers a promising avenue for improving performance and interpretability in protein variant effect prediction.', '> ', '16,19c19,22', '< Predicting protein function and properties using machine-learning based approaches continues to be an innovative and important area of research.', '< Recently, unsupervised approaches have gained significant momentum where models trained in a self-supervised fashion have shown impressive results for zero-shot estimates of protein fitness and variant effects relative to a reference protein [3,[9][10][11].', '< Supervised learning is a crucial method of utilizing experimental data to predict protein fitness. This is particularly valuable when the trait of interest correlates poorly with the evolutionary signals that unsupervised models capture during training or if multiple traits are considered. Supervised protein fitness prediction using machine learning has been explored in detail in [12], where a comprehensive overview can be found. A common strategy is to employ transfer learning via embeddings extracted from self-supervised models [13,14], an approach which increasingly relies on large pretrained language models such as ProtTrans [15], ESM-2 [16], and SaProt [17]. In [18], the authors propose to augment a one-hot encoding of the aligned amino acid sequence by concatenating it with a zero-shot score for improved predictions. This was further expanded upon with ProteinNPT [19], where sequences embedded with the MSA Transformer [20] and zero-shot scores were fed to a transformer architecture for state-of-the-art supervised variant effect prediction with generative capabilities.', '< Considerable progress has been made in defining meaningful and comprehensive benchmarks to reliably measure and compare model performance in both unsupervised and supervised protein fitness prediction settings. The FLIP benchmark [6] introduced three supervised predictions tasks ranging from local to global fitness prediction, where each task in turn was divided into clearly defined splits. The supervised benchmarks often view fitness prediction through a particular lens. Where FLIP targeted problems of interest to protein engineering; TAPE [21] evaluated transfer learning abilities; PEER [22] focused on sequence understanding; ATOM3D [23] considered a structure-based approach; FLOP [24] targeted wild type proteins; and ProteinGym focused exclusively on variant effect prediction [11]. The ProteinGym benchmark was recently expanded to encompass more than 200 standardized datasets in both zero-shot and supervised settings, including substitutions, insertions, deletions, and curated clinical datasets [11,7].', '---', '> The prediction of protein function and properties through machine learning has emerged as a highly dynamic and critical area of research.', '> In recent years, unsupervised approaches, particularly models trained in a self-supervised fashion, have demonstrated impressive capabilities in providing zero-shot estimates of protein fitness and variant effects relative to a reference protein [3,[9][10][11].', '> Supervised learning is an indispensable methodology for leveraging experimental data to predict protein fitness. This approach is especially vital when the specific trait of interest exhibits a weak correlation with the evolutionary signals captured by unsupervised models during their pre-training, or when multiple, diverse traits are under consideration. A detailed exploration of supervised protein fitness prediction using machine learning is presented in [12]. A prevalent strategy involves transfer learning, utilizing embeddings extracted from self-supervised models [13,14]. This methodology increasingly relies on large-scale pretrained protein language models, such as ProtTrans [15], ESM-2 [16], and SaProt [17]. Building on this, [18] introduced an approach to augment one-hot encodings of aligned amino acid sequences by concatenating them with zero-shot scores, leading to enhanced predictions. This concept was further advanced by ProteinNPT [19], which employed sequences embedded with the MSA Transformer [20] and zero-shot scores as input to a transformer architecture, achieving state-of-the-art supervised variant effect prediction with additional generative capabilities.', '> Substantial progress has also been made in establishing meaningful and comprehensive benchmarks, crucial for reliably measuring and comparing model performance in both unsupervised and supervised protein fitness prediction contexts. The FLIP benchmark [6] introduced three distinct supervised prediction tasks, ranging from local to global fitness prediction, each meticulously partitioned into clearly defined data splits. These supervised benchmarks often approach fitness prediction from specific perspectives. For instance, FLIP addressed problems pertinent to protein engineering; TAPE [21] assessed transfer learning proficiencies; PEER [22] concentrated on sequence understanding; ATOM3D [23] adopted a structure-based methodology; FLOP [24] focused on wild-type proteins; and ProteinGym [11] was exclusively dedicated to variant effect prediction. The ProteinGym benchmark has recently been significantly expanded to include over 200 standardized datasets in both zero-shot and supervised settings, encompassing substitutions, insertions, deletions, and meticulously curated clinical datasets [11,7].', '547d549', '< ']
