Title: Kermut: Composite kernel regression for protein variant effects

Abstract: Reliable prediction of protein variant effects is crucial for both protein optimization and for advancing biological understanding. For practical use in protein engineering, it is important that we can also provide reliable uncertainty estimates for our predictions, and while prediction accuracy has seen much progress in recent years, uncertainty metrics are rarely reported. We here provide a Gaussian process regression model, Kermut, with a novel composite kernel for modeling mutation similarity, which obtains state-of-the-art performance for supervised protein variant effect prediction while also offering estimates of uncertainty through its posterior. An analysis of the quality of the uncertainty estimates demonstrates that our model provides meaningful levels of overall calibration, but that instance-specific uncertainty calibration remains more challenging.

Section: Introduction
The precise prediction of protein variant effects stands as a cornerstone in both fundamental biological research and the applied domain of protein engineering, enabling the rational design and optimization of proteins for diverse biotechnological and therapeutic applications. The field has witnessed a rapid acceleration in recent years, primarily driven by breakthroughs in machine learning methodologies [1][2][3], the proliferation of large-scale experimental datasets [4,5], and the establishment of robust, standardized benchmarks [6,7].
Despite significant advancements in predictive accuracy, the critical capability to quantify the uncertainties associated with these predictions remains an underexplored frontier. This oversight carries immediate and substantial practical implications. In protein engineering and design, where the goal is often to identify and prioritize promising candidates for costly experimental validation, the ability to gauge the trustworthiness of each individual prediction is paramount. For instance, in Bayesian optimization, a widely adopted strategy for guiding experimental search, the efficacy of acquisition functions fundamentally relies on accurate uncertainty estimates to efficiently navigate the vast protein fitness landscapes; well-calibrated uncertainties have been demonstrably linked to superior optimization performance [8].
This paper addresses the pressing need for high-quality uncertainty quantification in supervised protein variant effect prediction. Gaussian Processes (GPs), with their inherent capacity for providing closed-form expressions of posterior distributions, offer a natural and powerful framework for uncertainty estimation. Our primary objective is to demonstrate that state-of-the-art predictive performance can be achieved within the GP framework. To this end, we introduce Kermut, a novel Gaussian process regression model equipped with a sophisticated composite kernel. Kermut not only achieves state-of-the-art accuracy but also provides robust estimates of uncertainty through its posterior. A rigorous analysis of these uncertainty estimates reveals that while our model exhibits meaningful levels of overall calibration, the challenge of instance-specific uncertainty calibration persists. We release Kermut as a robust, high-performance baseline and advocate for a renewed emphasis on uncertainty quantification within this vital domain. Our key contributions are:
• We present Kermut, a Gaussian process model featuring a novel composite kernel that effectively integrates signals from pretrained sequence and structure models to capture intricate mutation similarities;
• We conduct a comprehensive evaluation of Kermut on the extensive ProteinGym substitution benchmark, demonstrating its ability to achieve state-of-the-art performance in supervised protein variant effect prediction, surpassing the accuracy of several recently proposed deep learning methods;
• We provide an in-depth calibration analysis, revealing that Kermut delivers well-calibrated uncertainties at an aggregate level, while highlighting the ongoing challenges in achieving perfectly calibrated instance-specific uncertainties;
• We illustrate that Kermut offers substantial computational advantages, enabling training and evaluation orders of magnitude faster than competing deep learning approaches, coupled with superior out-of-the-box calibration properties.
2 Related work
Gaussian Processes (GPs) have a long-standing history in machine learning, particularly valued for their ability to provide well-calibrated uncertainty estimates alongside predictions [65]. Their application spans diverse fields, including bioinformatics, where they have been used for tasks such as gene expression analysis and drug discovery. In the context of protein modeling, GPs offer a powerful non-parametric framework for capturing complex relationships within protein sequence and structure data. Recent advancements in GP scalability [72][73][74] have made them increasingly viable for larger biological datasets.

Protein sequence and structure modeling has seen a revolution with the advent of deep learning, especially large language models (LLMs) and graph neural networks. Pretrained protein language models like ESM-2 [16], ProtTrans [15], and SaProt [17] have demonstrated remarkable capabilities in learning rich, contextual embeddings from vast unannotated protein sequence data. These embeddings capture evolutionary and biochemical signals that are highly predictive of various protein properties. Similarly, structure-based models, such as inverse folding networks like ProteinMPNN [52], leverage known protein structures to predict amino acid sequences, thereby encoding information about local structural environments and their physicochemical constraints. The integration of these advanced representations into downstream predictive models, including kernel methods, offers a promising avenue for improving performance and interpretability in protein variant effect prediction.

Section: Protein property prediction
The prediction of protein function and properties through machine learning has emerged as a highly dynamic and critical area of research.
In recent years, unsupervised approaches, particularly models trained in a self-supervised fashion, have demonstrated impressive capabilities in providing zero-shot estimates of protein fitness and variant effects relative to a reference protein [3,[9][10][11].
Supervised learning is an indispensable methodology for leveraging experimental data to predict protein fitness. This approach is especially vital when the specific trait of interest exhibits a weak correlation with the evolutionary signals captured by unsupervised models during their pre-training, or when multiple, diverse traits are under consideration. A detailed exploration of supervised protein fitness prediction using machine learning is presented in [12]. A prevalent strategy involves transfer learning, utilizing embeddings extracted from self-supervised models [13,14]. This methodology increasingly relies on large-scale pretrained protein language models, such as ProtTrans [15], ESM-2 [16], and SaProt [17]. Building on this, [18] introduced an approach to augment one-hot encodings of aligned amino acid sequences by concatenating them with zero-shot scores, leading to enhanced predictions. This concept was further advanced by ProteinNPT [19], which employed sequences embedded with the MSA Transformer [20] and zero-shot scores as input to a transformer architecture, achieving state-of-the-art supervised variant effect prediction with additional generative capabilities.
Substantial progress has also been made in establishing meaningful and comprehensive benchmarks, crucial for reliably measuring and comparing model performance in both unsupervised and supervised protein fitness prediction contexts. The FLIP benchmark [6] introduced three distinct supervised prediction tasks, ranging from local to global fitness prediction, each meticulously partitioned into clearly defined data splits. These supervised benchmarks often approach fitness prediction from specific perspectives. For instance, FLIP addressed problems pertinent to protein engineering; TAPE [21] assessed transfer learning proficiencies; PEER [22] concentrated on sequence understanding; ATOM3D [23] adopted a structure-based methodology; FLOP [24] focused on wild-type proteins; and ProteinGym [11] was exclusively dedicated to variant effect prediction. The ProteinGym benchmark has recently been significantly expanded to include over 200 standardized datasets in both zero-shot and supervised settings, encompassing substitutions, insertions, deletions, and meticulously curated clinical datasets [11,7].

Section: Kernel methods for protein sequences
Kernel methods have seen much use for protein modeling and protein property prediction. Sequencebased string kernels operating directly on the protein amino acid sequences are one such example, where, e.g., matching k-mers at different ks quantify covariance. This has been used with support vector machines to predict protein homology [25,26]. Another example is sub-sequence string kernels, which in [27] is used in a Gaussian process for a Bayesian optimization procedure. In [28], string kernels were combined with predicted physicochemical properties to improve accuracy in the prediction of MHC-binding peptides and in protein fold classification. In [29], a kernel leveraging the tertiary structure for a protein family represented as a residue-residue contact map was used to predict various protein properties such as enzymatic activity and binding affinity. In [30], Gaussian process regression (GPR) was used to successfully identify promising enzyme sequences which were subsequently synthesized showing increased activity. In [31], the authors provide a comprehensive study of kernels on biological sequences which includes a thorough review of the literature as well as both theoretical, simulated, and in-silico results.
Most similar to our work is mGPfusion [32], in which a weighted decomposition kernel was defined which operated on the local tertiary protein structure in conjunction with a number of substitution matrices. Simulated stability data for all possible single mutations were obtained via Rosetta [33], which was then fused with experimental data for accurate ∆∆G predictions of single-and multimutant variants via GPR, thus incorporating both sequence, structure, and a biophysical simulator. In contrast to our approach, the mGPfusion-model does not leverage pretrained models, but instead relies on substitution matrices for its evolutionary signal. A more recent example of kernel-based methods yielding highly competitive results is xGPR [34], in which Gaussian processes with custom kernels show high performance when trained on protein language model embeddings, similarly to the sequence kernel in our work (see Section 3.3). Where xGPR introduces a set of novel random feature-approximated kernels with linear-scaling, Kermut instead uses the squared exponential kernel for sequence modeling while additionally modeling local structural environments. The models in xGPR were shown to provide both high accuracy and well-calibrated uncertainty estimation on the FLIP and TAPE benchmarks.

Section: Uncertainty quantification and calibration
Uncertainty quantification (UQ) for protein property prediction continues to be a promising area of research with immediate practical consequences. In [35], residual networks were used to model both epistemic and aleatoric uncertainty for peptide selection. In [36], GPR on MLP-residuals from biLSTM embeddings was used to successfully guide in-silico experimental design of kinase binders and fluorescent proteins. The authors of [37] augmented a Bayesian neural network by placing biophysical priors over the mean function by directly using Rosetta energy scores, whereby the model would revert to the biophysical prior when the epistemic uncertainty was large. This was used to predict fluorescence, binding, and solubility for drug-like molecules. In [38], state-of-the-art performance on protein-protein interactions was achieved by using a spectral-normalized neural Gaussian process [39] with an uncertainty-aware transformer-based architecture working on ESM-2 embeddings.
In [40], a framework for evaluating the epistemic uncertainty of deep learning models using confidence interval-based metrics was introduced, while [41] conducted a thorough analysis of uncertainty quantification methods for molecular property prediction. Here, the importance of supplementing confidence-based calibration with error-based calibration as introduced in [42] was highlighted, whereby the predicted uncertainties are connected directly to the expected error for a more nuanced calibration analysis. We evaluate our model using confidence-based calibration as well as error-based calibration following the guidelines in [41]. In [43], the authors conducted a systematic comparison of UQ methods on molecular property regression tasks, while [44] investigated calibratedness of regression models for material property prediction. In [45], the above approaches were expanded to protein property prediction tasks where the FLIP [6] benchmark was examined, while [46] benchmarked a number of UQ methods for molecular representation models. In [47], the authors developed an active learning approach for partial charge prediction of metal-organic frameworks via Monte Carlo dropout [48] while achieving decent calibration. In [49], a systematic analysis of protein regression models was conducted where well-calibrated uncertainties were observed for a range of input representations.

Section: Local structural environments
Much work has been done to solve the inverse-folding problem, where the most probable amino acid sequence to fold into a given protein backbone structure is predicted [50][51][52][53][54][55][56]. Inverse-folding models are trained on large datasets of protein structures and model the physicochemical and evolutionary constraints of sites in a protein conditioned on their structural contexts. These will form the basis of the structural featurization in our work. Local structural environments have previously been used for protein modeling. In [57], a 3D CNN was used to predict amino acid preferences giving rise to novel substitution matrices. In [58] and [59], surface-level fingerprinting given structural environments was used to model protein-protein interaction sites and for de novo design of protein interactions. In [60], chemical microenvironments were used to identify potentially beneficial mutations, while [61] used a similar approach, where they investigated the volume of the local environments and observed that the first contact shell delivered the primary signal thus emphasizing the importance of locality. In [62], a composition Hellinger distance metric based on the chemical composition of local residue environments was developed and used for a range of structure-related experiments. Recently, local structural environments were used to model mutational preferences for protein engineering tasks [63], however not in a Gaussian process framework as we propose here.

Section: Methods


Section: Preliminaries
We want to predict the outcome of an assay measured on a protein, represented by its amino acid sequence x of length L. We will assume that we have a dataset of N such sequences available, and that these are of equal length and structure, such that we can meaningfully refer to the effect at specific positions (sites) in the protein. In protein engineering, we typically consider modifications relative to an initial wild type sequence, x WT . We will assume that the 3D structure for the initial sequence, s, is available (either experimentally determined or provided by a structure predictor like AlphaFold [64]). Lastly, for variant x with mutations at sites M ⊆ {1, ..., L}, let x m denote the variant which has the same mutation as x at site m for m ∈ M and otherwise is equal to x WT .

Section: Gaussian processes
To predict protein variant effects, we rely on Gaussian process regression, which we shall now briefly introduce. For a comprehensive overview, see [65], which this section is based on.
Let X and Y be two random variables on the measurable spaces X and R, respectively, and let X = x 1 , ..., x N and y = y 1 , ..., y N be realizations of these random variables. We assume that y i = g(x i ) + ϵ, where g represents some unknown function and ϵ ∼ N (0, σ 2 ϵ ) accounts for random noise. Our objective is to model the distributions capturing our belief about g.
Gaussian processes are stochastic processes providing a powerful framework for modeling distributions over functions. The Gaussian process framework allows us to not only make predictions but also to quantify the uncertainty associated with each prediction. A Gaussian process is entirely specified by its mean and covariance functions, m(x) and k(x, x ′ ). We assume that the covariance matrix, K, of the outputs {y 1 , ..., y N } can be parameterized by a function of their inputs {x 1 , ..., x N }. The parameterization is defined by the kernel, k : X × X → R yielding K such that K ij = k(x i , x j ). For k to be a valid kernel, K needs to be symmetric and positive semidefinite.
Let f represent our approximation of g, f (x) ≈ g(x). Given a training set D = (X, y) and a number of test points X * , the function f * predicts the values of y * at X * . Using rules of normal distributions, we derive the posterior distribution p(f * |X * , D), providing both a prediction of y * at X * and a confidence measure, often expressed as ±2σ, where σ is the posterior standard deviation at a test point x * .
The kernel function often contains hyperparameters, η. These can be optimized by maximizing the marginal likelihood, p(y|X, η), which is known as type II maximum likelihood.

Section: Kermut
It has been shown that local structural dependencies are useful for determining mutation preferences [57,60,61,63]. We therefore hypothesize that constructing a composite kernel with components incorporating information about the local structural environments of residues will be able to model protein variant effects. To this end, we define a structure kernel, k struct , which models mutation similarity given the local environments of mutated sites. A schematic of how the structure kernel models covariances can be seen in Figure 1. In the following we shall define k 1 struct , a structure kernel that operating on single-mutant variants. Subsequently, we shall extend it to multi-mutant variants resulting in the structure kernel, k struct .
We hypothesize that for a given site in a protein, the distribution over amino acids given by a structureconditioned inverse folding model will reflect the effect of a mutation at that site. We consider such an amino acid distribution a representation of the local environment for that site as it reflects the various physicochemical and evolutionary constraints that the site is subject to. We thus presume that two sites with similar local environments will behave similarly if mutated. For instance, mutations at buried sites in the hydrophobic core of the protein will generally correlate more with each other than with surface-level mutations.
For single mutant variants we quantify site similarity using the Hellinger kernel k
H (x, x ′ ) = exp (-γ 1 d H (f IF (x), f IF (x ′ ))), with γ 1 > 0 [66]
, where d H is the Hellinger distance (see Appendix B.1). The function f IF : X 1 → [0, 1] 20 takes a single-mutant sequence, x, as input and returns a probability distribution over the 20 naturally occurring amino acids at the mutated site in x given by the inverse folding model. The Hellinger kernel will assign maximum covariance when two sites are identical. This however means that k H is incapable of distinguishing between different mutations at the same site since d H (x, x ′ ) = 1, when x and x ′ are mutated at the same site.
To increase flexibility and to allow intra-site comparisons, we introduce a kernel operating on the specific mutation likelihoods. We hypothesize that two variants with mutations on sites that are close in terms of the Hellinger distance will correlate further if the log-probabilities of the specific amino acids on the mutated sites are similar (i.e., the probability of the amino acid that we mutate to is similar at the two sites). We incorporate this by defining
k p (x, x ′ ) = k exp (f IF1 (x), f IF1 (x ′ )) = exp(-γ 2 ||f IF1 (x) -f IF1 (x ′ )||)
, where f IF1 : X 1 → [0, 1] takes a single-mutant sequence, x, as input and returns the log-probability (given by an inverse folding model) of the observed mutation, and where k exp is the exponential kernel.
Finally, we hypothesize that the effect of two mutations correlate further if the sites are close in physical space. Hence, we multiply the kernel with an exponential kernel on the Euclidean distance between sites: k d (x, x ′ ) = exp (-γ 3 d e (s i , s j )). Thereby, the closer two sites are physically, the more similar -and thus comparable -their local environments will be.
Taking the product of these kernel components, we get the following kernel for single-mutant variants, which assigns high covariance when two single mutant variants have mutations that have similar environments, are physically close, and have similar mutation likelihoods:
k 1 struct (x, x ′ ) = λk H (x, x ′ )k p (x, x ′ )k d (x, x ′ ),(1)
where the kernel has been scaled by a non-negative scalar, λ > 0.
In [63], the authors showed that a simple linear model operating on one-hot encoded mutations is sufficient to accurately predict mutation effects given sufficient data. Thus, we generalize the kernel to multiple mutations by summing over all pairs of sites differing at x and x ′ : The structure kernel models mutations linearly and cannot capture epistatic effects. We propose to add epistatic signals through a sequence kernel. Drawing on the rich literature for modeling protein sequences with Gaussian processes on embeddings [34,45,49,67], we use a squared exponential kernel which operates on sequence embeddings from a pretrained model. We use the 650M parameter ESM-2 protein language model [16], and perform mean-pooling across the length dimension, yielding z = f 1 (x), where f 1 produces mean-pooled embeddings, z ∈ R 1280 , of sequence x. We thus model the covariance between these representations as
k struct (x, x ′ ) = i∈M j∈M ′ k 1 struct (x i , x ′j )(2)
k seq (x, x ′ ) = k SE (f 1 (x), f 1 (x ′ )) = k SE (z, z ′ ) = exp - ||z -z ′ || 2 2 2σ 2 .(3)
We choose to add and weigh the structure and sequence kernels resulting in our final kernel formulation, whereby the model can leverage either structure or sequence similarities, depending on the presence and strength of each signal as determined through hyperparameter optimization:
k(x, x ′ ) = πk struct (x, x ′ ) + (1 -π)k seq (x, x ′ ).(4)
Additional details on both sequence and structure kernels, including a proof of the validity of the structure kernel, implementation details, computational complexity details, and an automatic model selection procedure can be found in Appendices B and C.

Section: Zero-shot mean function
Kermut can be used with a constant mean function, m(x) = α, where α is a hyperparameter optimized through the marginal likelihood. However, we posit that additional performance can be gained by using an altered mean function which operates on zero-shot fitness estimates, which are often available at relatively low cost: m(x) = αf 0 (x) + β, where f 0 is a zero-shot method evaluated on input sequence x. This is similar to the approach employed in [37], where Rosetta scores are used as a biophysical prior. We use ESM-2 [16], which yields the log-likelihood ratio between the variant and wild type residue as in [10]. For details, see Appendix B.

Section: Architecture considerations
While Kermut is based on relatively simple principles, its components are non-trivial in that they exploit learned features from pretrained models: ESM-2 provides protein sequence level embeddings and zero-shot scores, while ProteinMPNN provides a featurization of the local structural environments. We stress that these pretrained components can be readily replaced by other pretrained models. Models that generate (1) protein sequence embeddings, (2) structure-conditioned amino acid distributions, and (3) zero-shot scores are plentiful and such models will progress further in future years. In our work, we have not sought to find the optimal combination of these. Our work should instead be seen 

Section: Results
We evaluate Kermut on the 217 substitution DMS assays from the ProteinGym benchmark [7].
The overall benchmark results are an aggregate of three different cross-validation schemes: In the "random" scheme, variants are assigned to one of five folds randomly. In the "modulo" scheme, every fifth position along the protein backbone are assigned to the same fold, and in the "contiguous" scheme, the protein is split into five equal-sized segments along its length, each constituting a fold. For all three schemes, models are trained on four combined partitions and tested on the fifth for a total of five runs per assay, per scheme. The results are processed using the functionality provided in the ProteinGym repository. The average and per-scheme aggregated results can be seen in Table 1.
Our model reaches both higher Spearman correlations and lower mean squared errors than competing methods and thereby achieves state-of-the-art performance in all three schemes, with the largest gains in the challenging modulo and contiguous settings. In Table E.2, the performance per functional category is shown, where we observe a significant performance increase in binding-and stabilityrelated prediction tasks, likely explained by inclusion of the structure kernel. In addition to its high accuracy, Kermut is significantly faster compared to deep learning methods. We provide wall-clock times for running a 5-fold CV loop for a single split-scheme for four select datasets for both Kermut and ProteinNPT in Table C.1. Generating results for all three split schemes for the four datasets thus takes Kermut approximately 10 minutes while ProteinNPT takes upwards of 400 hours.
The non-parametric bootstrap standard error for each model relative to Kermut can be seen in Tables E.1 and E.3. In Appendix M, we provide additional results using alternate zero-shot functions.
The average and per-split performance for individual assays can be seen in Figures O.1 to O.4 while additional details on computation time can be seen in Appendix C. Lastly, visualizations of the distributions of optimized hyperparameters can be seen in Appendix N.

Section: Ablation study
To examine the impact of Kermut's components, we conduct an ablation study, where we ablate its two main kernels from Equation (4) -the structure and sequence kernels -as well as the structure kernel's subcomponents. We similarly investigate the importance of the zero-shot mean function.
The ablation study is carried out on all split schemes on a subset of 174 datasets. The difference between the ablation results and the Kermut results can be seen in Table 2, where larger values indicate large component importance. For the absolute values, see Appendix F, where we additionally include alternative kernel formulations for both structure and sequence kernels as well as for kernel composition.
As indicated by the largest drop in performance, the single most important component is the structure kernel. While removing the sequence kernel leads to comparable decreases for all three schemes,  The three first elements correspond to the three split-schemes from ProteinGym. The third and fourth correspond to training on both single and double mutants, and testing on each, respectively. For the last column, we train on single and test on double mutants, corresponding to an extrapolation setting.
removing the structure kernel primarily leads to drops in the challenging contiguous and modulo schemes. This shows that the structure kernel is crucial for characterizing unseen sites in the wild type protein. While removing the site comparison and mutation probability kernels leads to small and medium drops in performance, we observe that removing both leads to an even larger performance drop, indicating a synergy between the two. In Table E.2, the ablation results per functional category are shown, where observe that the inclusion of the structure kernel is crucial for the increased performance in structure-related prediction tasks such as binding and stability.

Section: Uncertainty quantification per mutation domain
By inspecting the posterior predictive variance, we can analyze model uncertainty. To this end, we define several mutation domains of interest [49]. We designate the three split schemes from the ProteinGym benchmark as three such domains. These are examples of interpolation, where we both train and test on single mutants (1M→1M). While the main benchmark only considers single mutations, some assays include additional variants with multiple mutations. We consider a number of these and define two additional interpolation domains where we train on both single and double mutations (1M/2M) and test on singles and doubles, respectively. As a challenging sixth domain, we train on single mutations only and test on doubles (1M→2M), constituting an extrapolation domain.
For details on the multi-mutant splits, see Appendix G.
Figure 2 shows the distributions of mean predictive variances in the six domains. In the three single mutant domains, we observe that the uncertainties increase from scheme to scheme, reflecting the difficulties of the tasks and analogously the expected performance scores (Table 1). When training on both single and double mutants (1M/2M), we observe a lower uncertainty on double mutants than single mutants. For many of the multi-mutant datasets, the mutants are not uniformly sampled but often include a fixed single mutation. A possible explanation is thus that it might be more challenging to decouple the signal from a double mutation into its constituent single mutation signals. In the extrapolation setting, we observe large predictive uncertainties, as expected. One explanation of the discrepancy between the variance distributions in the multi-mutant domains might lie in the difference in target distributions between training and test sets. For reference, we include the results for the multi-mutant domains in Table G.1 in the appendix.

Section: Uncertainty calibration analysis
To clarify the relationship between model uncertainty and expected performance, we proceed with a calibration analysis. First, we perform a confidence interval-based calibration analysis [41], resulting in calibration curves which in the classification setting are known as reliability diagrams [68]. The results for each dataset are obtained via five-fold cross validation, corresponding to five separately trained models for each split scheme. We select four diverse datasets as examples (Table 3), reflecting both high and low predictive performance. The mean calibration curves can be seen in Figure 3a. For method details and results across all datasets, see Appendix J. The mean expected calibration error (ECE) is shown in the bottom of each plot, where a value of zero indicates perfect calibration. Overall, the uncertainties appear to be well-calibrated both qualitatively from the curves and quantitatively from the ECEs. Even the smallest dataset (fourth row, N = 165) achieves decent calibration, albeit with larger variances between folds.
While the confidence interval-based calibration curves show that we can trust the uncertainty estimates overall, they do not indicate whether we can trust individual predictions. We therefore supplement the above analysis with an error-based calibration analysis [42], where a well-calibrated model will have low uncertainty when the error is low. The calibration curves can be seen in Figure 3b. We compute the per CV-fold expected normalized calibration error (ENCE) and the coefficient of variation (c v ), which quantifies the variance of the predicted uncertainties. Ideally, the ENCE should be zero while the coefficient of variation should be relatively large (indicating spread-out uncertainties).
While the confidence interval-based analysis suggested that the uncertainty estimates are wellcalibrated with some under-confidence, the same is not as visibly clear for the error-based calibration plots, suggesting that the expected correlation between model uncertainty and prediction error is not a given. We do however see an overall trend of increasing error with increasing uncertainty in three of the four datasets, where the curves lie close to the diagonal (as indicated by the dashed line).
The second row shows poorer calibration -particularly in the modulo and contiguous schemes. The curves however remain difficult to interpret, in part due to the errorbars on both axes. To alleviate this, we compute similar metrics for all 217 datasets across the three splits, which indicate that, though varying, most calibration curves are well-behaved (see Appendix J.3).
We supplement the above calibration curves with Figures K.1 to K.4 in the Appendix, which show the true values plotted against the predictions. These highlight the importance of well-calibrated uncertainties and underline their role in interpreting model predictions and their trustworthiness.

Section: Comparison with ProteinNPT
Monte Carlo (MC) dropout [48] is a popular uncertainty quantification technique for deep learning models. Calibration curves for ProteinNPT with MC dropout for the same four datasets across the three split schemes can be seen in Appendix L.2 while figures showing the true values plotted against the predictions with uncertainties are shown in Appendix L. 3. These indicate that employing MC dropout on a deep learning model like ProteinNPT seems to provide lower levels of calibration, providing overconfident uncertainties across assays and splits. Due to the generally low uncertainties and the resulting difference in scales, the calibration curves are often far from the diagonal. The trends in the calibration curves however show that the model errors often correlate with the uncertainties, suggesting that the model can be recalibrated to achieve decent calibration.
Other techniques for uncertain quantification in deep learning models certainly exist, and we by no means rule out that other techniques can outperform our method (see Discussion below). We note, however, that many uncertainty quantification methods will be associated with considerable computational overhead compared to the built-in capabilities of a Gaussian process.

Section: Discussion
We have shown that a carefully constructed Gaussian process is able to reach state-of-the-art performance for supervised protein variant effect prediction while providing reasonably well-calibrated  uncertainty estimates. For a majority of datasets, this is achieved orders of magnitude faster than competing methods.
While the predictive performance on the substitution benchmark is an improvement over previous methods, our proposed model has its limitations. Due to the site-comparison mechanism, our model is unable to handle insertions and deletions as it only operates on a fixed structure. Additionally, as the number of mutations increases, the assumption of a fixed structure might worsen, depending on the introduced mutations, which can affect reliability as the local environments might change. An additional limitation is the GP's O(N 3 ) scaling with dataset size. While not a major obstacle in the single mutant setting, dataset sizes can quickly grow when handling multiple mutants. The last decades have however produced a substantial literature on algorithms for scaling GPs to larger datasets [72][73][74], which could alleviate the issue, and we therefore believe this to be a technical rather than fundamental limitation. An additional limitation might present itself it in the multi-mutant setting, where the lack of explicit modeling of epistasis can potentially hinder extrapolation to higher-order mutants, prompting further investigation.
Well-calibrated uncertainties are crucial for protein engineering; both when relying on a Bayesian optimization routine to guide experimental design using uncertainty-dependent acquisition functions and similarly to weigh the risk versus reward for experimentally synthesizing suggested variants. We therefore encourage the community to place a greater emphasis on uncertainty quantification and calibration for protein prediction models as this will have measurable impacts in real-life applications like protein engineering -perhaps more so than increased prediction accuracy. We hope that Kermut can serve as a fruitful step in this direction.

Section: Appendix A License and code availability
The codebase is publicly available at https://github.com/petergroth/kermut under the open source MIT License.

Section: B GP details B.1 Structure kernel
The structure kernel is comprised of three components, each increasing model flexibility. The sitecomparison kernel, k H , compares site-specific, structure-conditioned amino acid distributions. Given two such discrete probability distributions, p := f IF (x) and q := f IF (x ′ ), their distance is quantified via the Hellinger distance [75] d
H (p, q) = 1 √ 2 20 i=1 ( √ p i - √ q i ) 2 ,
which is used in the Hellinger kernel [66] k
H (x, x ′ ) = exp (-γ 1 d H (p, q)) = exp   -γ 1 1 √ 2 20 i=1 ( √ p i - √ q i ) 2   .
For the mutation probability kernel, k p , we use the log-probabilities rather than the amino acid identities to reflect that amino acids with similar probability on sites with similar distributions should have similar biochemical effects on the protein. We do not include the log-probabilities of the wild type amino acids, as the inverse folding model by definition is trained to assign high probabilities to the wild type sequence. The exact probability of a wild type amino acid depends on how many other amino acids that are likely to be at the given site. For example, we would expect a probability close to one for a functionally critical amino acid at a particular site. Conversely, for a less critical surfacelevel residue requiring, e.g., a polar uncharged amino acid, we would expect similar probabilities for the four amino acids of this type. Thus, variations of log-probabilities of the wild type amino acids should be reflected in the distribution on the sites captured by the Hellinger kernel.
For the per-residue amino acid distributions in k H and k p , we use ProteinMPNN [52]. ProteinMPNN relies on a random decoding order and thus benefits from multiple samples. We decode the wild type amino acid sequence a total of ten times while conditioning on the full structure and the sequence that has been decoded thus far. We then compute a per-residue average distribution. We use the v_48_020 weights.
For the distance kernel, k d , we calculate the Euclidean distance between α carbon atoms, where the unit is in Ångstrøm. All wild type structures used in k d and by ProteinMPNN are predicted via AlphaFold2 [64] and are provided by ProteinGym [7].
The sum over all pairs of mutations in equation 2 is motivated by [63] who showed that a linear model was sufficient to effectively model the mutation effects. Note that Cov(
Y 1 + Y 2 , Y 3 + Y 4 ) = Cov(Y 1 , Y 3 ) + Cov(Y 1 , Y 4 ) + Cov(Y 2 , Y 3 ) + Cov(Y 2 , Y4
). Hence, if the Y i 's are the effects of single mutations, we can find the covariance between two double mutant variants by calculating the pairwise sum and similarly for other number of mutations.

Section: B.2 Sequence kernel
For the sequence kernel, we use embeddings extracted from the ESM-2 protein language model [16].
We use the esm2_t36_650M_UR50D model with 650M parameters. The embeddings are mean-pooled across the sequence dimension such that each variant is represented by a 1280 dimensional vector.
While it has been shown that other aggregation method can lead to large increases in performance [76], we consider alternate methods such as training a bottleneck model out of the scope of this paper.

Section: B.3 Parametrization note
An alternative formulation of the kernel, where we omit λ from Equation (1) and π from Equation (4), and instead provide the structure and sequence kernels with separate coefficients have shown to provide close to identical results when using a smoothed box prior (constrained between 0 and 1). While this approach is somewhat more elegant, this does not justify the re-computation of all results.

Section: B.4 Zero-shot mean function
For the zero-shot mean function, we download and use the pre-computed zero-shot scores from ProteinGym at https://github.com/OATML-Markslab/ProteinGym. The zero-shot value is calculated as the log-likelihood ratio between the variant and the wild type at the mutated residue as described in [10]:
f 0 (x) = i∈M log p(x i ) -log p(x WT i )
The values can be computed straightforwardly using the ESM suite (using the masked-marginals strategy). For multi-mutants, the sum of the ratios is taken.

Section: B.5 Kernel proof
We will in the following argue why Kermut's structure kernel is a valid kernel. Recall that a function k : X × X → R is a kernel if and only if the matrix K, where K ij = k(x i , x j ), is symmetric positive semi-definite [65]. From literature we know a number of kernels and certain ways these can be combined to create new kernels. We will argue that Kermut is a kernel by showing how it is composed of known kernels, combined using valid methods.
Note that, if we have a mapping f : X → Z and a kernel k Z , on Z, then k X defined by k
X (x, x ′ ) := k Z (f (x), f (x ′ )) is a kernel on X .
Let X be the space of sequences parameterized with respect to a reference sequence. Let f 1 : X → Z be a transformation of the sequences defined as in Section 3.3. k seq is the squared exponential kernel on the transformed variants. Hence, k seq is is a kernel on X .
Let X 1 ⊂ X denote the subspace of single mutant variants and f 2 : X 1 → R 3 be a function mapping single mutant variants into the 3D coordinates of the α-carbon of the particular mutation. k d is the exponential kernel on this transformed space. Thus k d is a kernel on X 1 .
Let f IF : X 1 → G ⊆ [0, 1] 20 be defined as in Section 3.3, where G is the space of probability distributions over the 20 amino acids. f IF (x) is the probability distribution over the mutated site of x given by an inverse folding model. k H is the Hellinger kernel on the single mutation variants transformed by f IF , hence, a valid kernel on X 1 . Likewise k p is the exponential kernel of a transformation, f IF1 : X 1 → [0, 1] as defined in Section 3.3, of the sequences, hence also a kernel.
Scaling, multiplying, and adding kernels result in new kernels, making k 1 struct a valid kernel for single mutations [65]. We need to show that k struct and thereby k is valid for any number of mutations.
Let f 4 : X → B be a function taking a variant x with M mutations and mapping it to a set b = {x m } m∈M of all the single mutations which constitutes x. Define the set kernel [77] 
k set (b, b ′ ) := x m ∈b,x ′m ∈b ′ λk 1 struct (x m , x ′m )
k struct is the set kernel on the transformed input and thus also a kernel. We have thereby shown that Kermut is a kernel for variants with any number of mutations.

Section: B.6 Automatic Model Selection
Kermut has been developed from the ground up as described Section 3.3, which led to separate structure and sequence kernels which are added together. Alternatively, an automatic model selection scheme can be employed for data-driven kernel composition as proposed in Chapter 3 in [78]. For demonstrative purposes, we conduct such a model selection procedure. We define four base kernels (k H , k p , k d , k seq ) as well as sum and product operations. We choose a subset of 17 ProteinGym assays and fit a GP (with a zero-shot mean function) and each of the four kernels, where the best performing kernel across splits is kept. For the second round, the remaining three kernels are either added or multiplied to the existing kernel. This process is continued until either all four kernels are used or until the test performance no longer increases. The results can be seen in Table B.1, where the final kernel is the product of the four base kernels. The results on the 174 ablation datasets using this kernel (denoted as "Kermut (product)") can be seen with the main Kermut GP in Table F 

Section: C Implementation details C.1 General details
We build our kernel using the GPyTorch framework [79]. We assume a homoschedastic Gaussian noise model, on which we place a HalfCauchy prior [80] with scale 0.1. We fit the hyperparameters by maximizing the exact marginal likelihood with gradient descent using the AdamW optimizer [81] with learning rate 0.1 for a 150 steps, which proved to be sufficient for convergence for a number of sampled datasets.

Section: C.2 System details
All experiments are performed on a Linux-based cluster running Ubuntu 20.04.4 LTS, with a AMD EPYC 7642 48-Core Processor with 192 threads and 1TB RAM. NVIDIA A40s were used for GPU acceleration both for fitting the Gaussian processes and for generating the protein embeddings.

Section: C.3 Compute time
There are multiple factors to consider when evaluating the training time of Kermut. A limitation of using a Gaussian process framework is the cubic scaling with dataset size. For this reason, we ran the ablation study on only 174 of the 217 datasets.
Training Kermut on these datasets for a single split-scheme using the aforementioned hardware (single GPU) takes approximately 1 hour and 30 minutes. Getting ablation results for all three schemes thus takes between 4-5 hours. This however assumes that 1. the embeddings for the sequence kernel have been precomputed, 2. the probability distribution for all sites in the wild type protein have been precomputed, 3. and that the zero-shot scores have been precomputed.
We do not see any of these as major limitations as the same applies to ProteinNPT and similar models.
Scaling the experiments to the full ProteinGym benchmark is however costly. We were able to train/evaluate Kermut on 215/217 datasets using an NVIDIA A40 GPU with 48GB VRAM. The remaining two, POLG_CXB3N_Mattenberger_2021 and POLG_DEN26_Suphatrakul_2023, datasets were too large to fit into GPU memory without resorting to reduced precision. For these, we trained and evaluated the model using CPU only which takes considerable time.

Section: C.4 Handling long sequences
The structures used for obtaining the site-wise probability distributions are predicted by Al-phaFold2 [64] and are provided in the ProteinGym repository [7]. The provided structures for A0A140D2T1_ZIKV and POLG_HCVJF do not contain the full structures however, but only a localized area where the mutations occur due to the long sequence lengths. Since our model only operates on sites with mutations, this is not an issue for neither the inverse-folding probability distributions nor the inter-residue distances.
For BRCA2_HUMAN, three PDB files are provided due to the long wild type sequence length. We use ProteinMPNN to obtain the distributions at all sites in each PDB file and stitch them together in a preprocessing step. Calculating the inter-residue distances is however non-trivial and would require a careful alignment of the three structures. Instead, we drop the distance term, k d , in the kernel for the BRCA2_HUMAN_Erwood_2022_HEK293T dataset (equivalently setting it to one:
k d (x, x ′ ) = 1).
The sequence kernel operates on ESM-2 embeddings. The ESM-2 model has a maximum sequence length of 1022 amino acids. Protein sequences that are longer than this limit are truncated.

Section: C.4.1 Example wall-clock time
The wall-clock times for generating the results used for the calibration curves in Figures 3a and3b and Appendix L.2 for one split scheme can be seen in Table C.1 for Kermut and ProteinNPT. The experiments were carried out using identical hardware. The test system is however a shared compute cluster with sharded GPUs, so variance is expected between runs. Generating the full results for the figures thus takes Kermut approximately 10 minutes while ProteinNPT takes 400 hours. This shows the significant reduction in computational burden that Kermut allows for. Both ProteinNPT and Kermut assumes that sequence embeddings are available a priori. 

Section: C.5 Computational complexity
Evaluating the kernel for two variants x and x ′ , involves computation of the two components k seq and k struct :
For each variant k seq requires a forward pass through ESM-2, which is based on the transformer architecture and has quadratic scaling with respect to the sequence length. Given the ESM-2 embeddings, the computational complexity is constant in sequence length and number of mutations.
k struct requires a single forward pass through ProteinMPNN for the wild type protein. Given the output of ProteinPMNN, the computational complexity for evaluation of the kernel is m 1 × m 2 for two proteins with m 1 and m 2 number of mutations. The computational complexity of k 1 struct is constant in sequence length and number of mutations.

Section: D Data
All data and evaluation software is accessed via the ProteinGym [7] repository at https://github. com/OATML-Markslab/ProteinGym which is under the MIT License.

Section: E Detailed results
The ProteinGym suite provides an aggregation procedure, whereby the predictive performance across both cross-validation schemes and functional categories can be gauged. We provide these results in Tables E.1 to E.4. We mirror the label normalization from [19] and [7], where, for each fold of cross-validation, train and test labels are normalized given the mean and standard deviation of the training labels. The Spearman correlation coefficient and MSE are then computed in the normalized space across folds, leading to a single Spearman coefficient and MSE per assay per split (i.e., not as an average of metrics per CV-fold). 

Section: E.1 Results on new splits
The modulo and contiguous splits in ProteinGym were updated in April 2024. For completeness, we here provide results using Kermut on the updated splits for all 217 DMS assays. These can be seen in Table E. 5, where we observe a slight decrease in performance for the contiguous and modulo splits, compared to the reference results in Table 1. 

Section: G Results for multi-mutants in ProteinGym
69 of the datasets from the ProteinGym benchmark include multi-mutants. In addition to the random, modulo, and contiguous split, these also have a fold_rand_multiples split. We here show the results for Kermut in this setting. Of the 69 datasets, we select 52 which (due to the cubic scaling of fitting GPs) include fewer than 7500 variant sequences. We additionally ignore the GCN4_YEAST_Staller_2018 dataset which has a very large number of mutations. This leads to a total of 51 datasets. All results where the training domain is "1M/2M→" are from models trained using the above split. The "1M→" domain results correspond to training the model once on single mutants and evaluating it on double mutants. In addition to Kermut, we include as a baseline results from a GP using the sequence kernel operating on mean-pooled ESM-2 embeddings. This is equivalent to setting the structure kernel to 0, k struct = 0 as in Table 2. For results on the multi-mutant GB1 landscape from FLIP [6], see Appendix H. 

Section: H Results on GB1 landscape from the FLIP benchmark
To further investigate Kermut's performance in a multi-mutant setting, we apply it to the GB1 fitness landscape from FLIP [6]. The results can be seen in Table H.1. Kermut's base configuration severely underperforms in the 1-vs-rest split, while reaching similar correlation coefficients in the 2-vs-rest and 3-vs-rest splits. A reason for the initial low score might be the that the accuracy of zero-shot methods at different mutation orders tend to decrease with higher mutation count. Despite this, we still see low performance compared to other models when removing the zero-shot mean function.
The GB1 landscape is comprised of 149,361 mutations at exactly four highly epistatic positions in the GB1 binding domain of Protein G. This is a challenging task for Kermut, whose structural kernel directly compares sites. With only four sites to compare, Kermut fails to accurately model the fitness landscape. Additionally, as described in the main text, the only epistatic modeling in Kermut is via the mean-pooled protein language model embeddings in the sequence kernel, which proves insufficient to capture the interplay of these highly epistatic sites. Further experimentation is required to thoroughly gauge Kermut's performance across diverse multi-mutant assays where the number of mutated residues is variable and epistatis plays a central role.  

Section: J Uncertainty calibration J.1 Confidence interval-based calibration
Given a collection of mean predictions and uncertainties, we wish to gauge how well-calibrated the uncertainties are. The posterior predictive mean and variance for each data point is interpreted as a Gaussian distribution and symmetric intervals of varying confidence are placed on each prediction [41]. In a well-calibrated model, approximately x % of predictions should lie within a x % confidence interval, e.g., 50% of observations should fall in the 50% confidence interval. The confidence intervals are discretized into K bins and the fraction of predictions falling within in bin is calculated. The calibration curve then plots the confidence intervals vs. the fractions, whereby a diagonal line corresponds to perfect calibration. Given the fractions and confidence intervals, the expected calibration error (ECE) is calculated as
ECE = 1 K K i=1 |acc(i) -i|,
where K is the number of bins, i indicates the equally spaced confidence intervals, and acc(i) is the fraction of predictions falling within the ith confidence interval.

Section: J.2 Error-based calibration
An alternative method of gauging calibratedness is error-based calibration where the prediction error is tied directly to predictions [41,42]. The predictions are sorted according to their predictive uncertainty and placed into K bins. For each bin, the root mean square error (RMSE) and root mean variance (RMV) is computed. In error-based calibration, a well-calibrated model as equal RMSE and RMV, i.e., a diagonal line. The x and y values in the resulting calibration plot are however not normalized from 0 to 1 as in confidence interval-based calibration. The expected normalized calibration error (ENCE) can be computed as
ENCE = 1 K K i=1 |RMV(i) -RMSE(i)| RMV(i) .
Additionally, we compute the coefficient of variation (c v ) as
c v = N n=1 (σn-µσ) 2 N -1 µ σ ,
where µ σ = 1 N N n=1 σ n , and where n indexes the N data points [42].

Section: J.3 Uncertainty calibration across all datasets
To quantitatively describe the calibratedness of Kermut across all datasets, we compute the above calibration metrics for all split schemes and folds. We do this for Kermut and a baseline GP using the sequence kernel on ESM-2 embeddings (equivalent to the sequence kernel from Equation (3) with a constant mean). These values can be seen in Figure J.1 for each of the three main split-schemes, the 1M/2M→1M/2M split ("Multiples"), and the 1M→2M split ("Extrapolation"). For the three main splits, we see that the calibratedness measured by the ENCE correlates with performance, where the lowest values are seen in the random setting. Generally, we see that the inclusion of the structure kernel improves not only performance (see Table 2) but also calibration, as the ECE and ENCE values are consistently better for Kermut, with the exception of ENCE in the extrapolation domain. We however see that the sequence kernel (squared exponential) consistently provide predictive variances that themselves vary more, which is generally preferable.
Overall, we can conclude that Kermut appears to be well-calibrated both qualitatively and quantitatively. While the ECE values are generally small and similar, the ENCE values suggest a more nuanced calibration landscape, where we can expect low errors when our model predicts low uncertainties, particularly in the random scheme.
For each dataset, split, and fold, we perform a linear regression to the error-based calibration curves and summarize the slope and intercept. As perfect calibration corresponds to a diagonal line, we want the distribution over the slopes to be centred on one and the distribution over intercepts to be centred on zero. Boxplots for this analysis can be seen in Appendix J.  

Section: L Calibration curves for ProteinNPT L.1 ProteinNPT details
We use ProteinNPT using the provided software in the paper with the default settings. We generate the MSA Transformer embeddings manually using the provided software from ProteinNPT. During evaluation, we predict using Monte Carlo dropout (with 25 samples, as described in the ProteinNPT appendix). An uncertainty estimate per test sequence is obtained by taking the standard deviation over the 25 samples as prescribed.   

Section: L.2 Calibration curves


Section: M Alternative zero-shot methods
Kermut uses a linear transformation of a variant's zero-shot score as its mean function. In the main results ESM-2 was used. We here provide additional results where different zero-shot methods are used. The experiments are carried out as the ablation results in Section 4.1, i.e., on 174/217 datasets. All zero-shot scores are pre-computed and are available via the ProteinGym suite.
Using a zero-shot mean function instead of a constant mean evidently leads to increased performance. The magnitude of the improvement depends on the chosen zero-shot method, where the order roughly corresponds to that of the zero-shot scores in ProteinGym. We do however see that opting for EVE yields the largest performance increase.   • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

Section: Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [No] Justification: We have not described safeguards due to the low risk of misuse, while we however describe potential misuse in Appendix P.
Guidelines:
• The answer NA means that the paper poses no such risks.
• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [Yes]
Justification: Credit is given to the benchmark data and software in Appendix D.
Guidelines:
• The answer NA means that the paper does not use existing assets.
• The authors should cite the original paper that produced the code package or dataset.
• The authors should state which version of the asset is used and, if possible, include a URL. • The name of the license (e.g., CC-BY 4.0) should be included for each asset.
• For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

Section: Acknowledgments and Disclosure of Funding
This work was funded in part by Innovation Fund Denmark (1044-00158A), the Novo Nordisk Synergy grant (NNF200C0063709), VILLUM FONDEN (40578), the Pioneer Centre for AI (DNRF grant number P1), and the Novo Nordisk Foundation through the MLSS Center (Basic Machine Learning Research in Life Science, NNF20OC0062606).

Section: O Detailed results per DMS
We show the performance of Kermut and ProteinNPT per DMS assay in Figures O. 1 to O.4. The figures show the average performance and the performance per split, respectively. O.1 Detailed results per DMS (average) 0.00 0.25 0.50 0.75 1.00 Q53Z42_HUMAN_McShan_2019_expression CAPSD_AAV2S_Sinai_2021 DLG4_HUMAN_Faure_2021 POLG_DEN26_Suphatrakul_2023 ANCSZ_Hobbs_2022 PHOT_CHLRE_Chen_2023 SRC_HUMAN_Ahler_2019 YAP1_HUMAN_Araya_2012 CASP3_HUMAN_Roychowdhury_2020 KCNE1_HUMAN_Muhammad_2023_function Q59976_STRSQ_Romero_2015 C6KNH7_9INFA_Lee_2018 D7PM05_CLYGR_Somermeyer_2022 TRPC_SACS2_Chan_2017 A0A2Z5U3Z0_9INFA_Doud_2016 NUD15_HUMAN_Suiter_2020 VKOR1_HUMAN_Chiasson_2020_abundance S22A1_HUMAN_Yee_2023_activity TADBP_HUMAN_Bolognesi_2019 OPSD_HUMAN_Wan_2019 PKN1_HUMAN_Tsuboyama_2023_1URF RASK_HUMAN_Weng_2022_binding-DARPin_K55 MTH3_HAEAE_RockahShmuel_2015 FKBP3_HUMAN_Tsuboyama_2023_2KFV CAR11_HUMAN_Meitlis_2020_lof RASH_HUMAN_Bandaru_2017 NPC1_HUMAN_Erwood_2022_RPE1 SPIKE_SARS2_Starr_2020_binding AMIE_PSEAE_Wrenbeck_2017 P53_HUMAN_Kotler_2018 BLAT_ECOLX_Jacquier_2013 RL40A_YEAST_Roscoe_2013 RAD_ANTMA_Tsuboyama_2023_2CJJ RFAH_ECOLI_Tsuboyama_2023_2LCL KKA2_KLEPN_Melnikov_2014 S22A1_HUMAN_Yee_2023_abundance CP2C9_HUMAN_Amorosi_2021_abundance PABP_YEAST_Melamed_2013 UBR5_HUMAN_Tsuboyama_2023_1I2T SOX30_HUMAN_Tsuboyama_2023_7JJK SPG1_STRSG_Wu_2016 P84126_THETH_Chan_2017 PSAE_SYNP2_Tsuboyama_2023_1PSE ODP2_GEOSE_Tsuboyama_2023_1W4G PR40A_HUMAN_Tsuboyama_2023_1UZC POLG_PESV_Tsuboyama_2023_2MXD PRKN_HUMAN_Clausen_2023 SBI_STAAM_Tsuboyama_2023_2JVG NPC1_HUMAN_Erwood_2022_HEK293T BCHB_CHLTE_Tsuboyama_2023_2KRU SAV1_MOUSE_Tsuboyama_2023_2YSB ISDH_STAAW_Tsuboyama_2023_2LHR SPTN1_CHICK_Tsuboyama_2023_1TUD DOCK1_MOUSE_Tsuboyama_2023_2M0Y PITX2_HUMAN_Tsuboyama_2023_2L7M THO1_YEAST_Tsuboyama_2023_2WQG YAIA_ECOLI_Tsuboyama_2023_2KVT SCIN_STAAR_Tsuboyama_2023_2QFF RCRO_LAMBD_Tsuboyama_2023_1ORC DN7A_SACS2_Tsuboyama_2023_1JIC CUE1_YEAST_Tsuboyama_2023_2MYX YNZC_BACSU_Tsuboyama_2023_2JVD CP2C9_HUMAN_Amorosi_2021_activity SPIKE_SARS2_Starr_2020_expression ARGR_ECOLI_Tsuboyama_2023_1AOY OTC_HUMAN_Lo_2023 P53_HUMAN_Giacomelli_2018_WT_Nutlin CBPA2_HUMAN_Tsuboyama_2023_1O6X OTU7A_HUMAN_Tsuboyama_2023_2L2D CATR_CHLRE_Tsuboyama_2023_2AMI SPA_STAAU_Tsuboyama_2023_1LP1 SRBS1_HUMAN_Tsuboyama_2023_2O2W RS15_GEOSE_Tsuboyama_2023_1A32 NKX31_HUMAN_Tsuboyama_2023_2L9R MBD11_ARATH_Tsuboyama_2023_6ACV SPG2_STRSG_Tsuboyama_2023_5UBS GRB2_HUMAN_Faure_2021 CBX4_HUMAN_Tsuboyama_2023_2K28 MYO3_YEAST_Tsuboyama_2023_2BTT BLAT_ECOLX_Stiffler_2015 VILI_CHICK_Tsuboyama_2023_1YU5 BBC1_YEAST_Tsuboyama_2023_1TG0 HCP_LAMBD_Tsuboyama_2023_2L6Q A4GRB6_PSEAI_Chen_2020 HECD1_HUMAN_Tsuboyama_2023_3DKM BLAT_ECOLX_Firnberg_2014 FECA_ECOLI_Tsuboyama_2023_2D1U MAFG_MOUSE_Tsuboyama_2023_1K1V DNJA1_HUMAN_Tsuboyama_2023_2LO1 CSN4_MOUSE_Tsuboyama_2023_1UFM EPHB2_HUMAN_Tsuboyama_2023_1F0M VG08_BPP22_Tsuboyama_2023_2GP8 RCD1_ARATH_Tsuboyama_2023_5OAO OBSCN_HUMAN_Tsuboyama_2023_1V1C UBE4B_HUMAN_Tsuboyama_2023_3L1X PIN1_HUMAN_Tsuboyama_2023_1I6C VRPI_BPT7_Tsuboyama_2023_2WNM RD23A_HUMAN_Tsuboyama_2023_1IFY SDA_BACSU_Tsuboyama_2023_1PV0 TCRG1_MOUSE_Tsuboyama_2023_1E0L TNKS2_HUMAN_Tsuboyama_2023_5JRT AMFR_HUMAN_Tsuboyama_2023_4G3O RBP1_HUMAN_Tsuboyama_2023_2KWH SR43C_ARATH_Tsuboyama_2023_2N88 SQSTM_MOUSE_Tsuboyama_2023_2RRU RPC1_BP434_Tsuboyama_2023_1R69 PPARG_HUMAN_Majithia_2016 NUSA_ECOLI_Tsuboyama_2023_1WCL NUSG_MYCTU_Tsuboyama_2023_2MI6 ProteinNPT Kermut (a) Top half 0.00 0.25 0.50 0.75 1.00
LGK_LIPST_Klesmith_2015 CCDB_ECOLI_Adkar_2012
O.2 Detailed results per DMS (random)
ProteinNPT Kermut (a) Top half 0.00 0.25 0.50 0.75 1.00 HIS7_YEAST_Pokusaeva_2019 A0A1I9GEU1_NEIME_Kennouche_2019 GCN4_YEAST_Staller_2018 SCN5A_HUMAN_Glazer_2019 CALM1_HUMAN_Weile_2017 CAS9_STRP1_Spencer_2017_positive AICDA_HUMAN_Gajula_2014_3cycles ENV_HV1BR_Haddox_2016 TPK1_HUMAN_Weile_2017 RAF1_HUMAN_Zinkus-Boltz_2019 VKOR1_HUMAN_Chiasson_2020_activity ENVZ_ECOLI_Ghose_2023 CBS_HUMAN_Sun_2020 CCDB_ECOLI_Tripathi_2016 KCNJ2_MOUSE_Coyote-Maestas_2022_function F7YBW8_MESOW_Aakre_2015 ENV_HV1B9_DuenasDecamp_2016 HEM3_HUMAN_Loggerenberg_2023 GDIA_HUMAN_Silverstein_2021 CCR5_HUMAN_Gill_2023 PA_I34A1_Wu_2015 BRCA2_HUMAN_Erwood_2022_HEK293T A0A2Z5U3Z0_9INFA_Wu_2014 GFP_AEQVI_Sarkisyan_2016 UBE4B_MOUSE_Starita_2013 REV_HV1H2_Fernandes_2016 TAT_HV1BR_Fernandes_2016 RPC1_LAMBD_Li_2019_high-expression A0A140D2T1_ZIKV_Sourisseau_2019 OXDA_RHOTO_Vanella_2023_activity PAI1_HUMAN_Huttinger_2021 MK01_HUMAN_Brenan_2016 SYUA_HUMAN_Newberry_2020 ADRB2_HUMAN_Jones_2020 HSP82_YEAST_Cote-Hammarlof_2020_growth-H2O2 TPMT_HUMAN_Matreyek_2018 CD19_HUMAN_Klesmith_2019_FMC_singles CAR11_HUMAN_Meitlis_2020_gof TPOR_HUMAN_Bridgford_2020 PTEN_HUMAN_Mighell_2018 SHOC2_HUMAN_Kwon_2022 NCAP_I34A1_Doud_2015 KCNJ2_MOUSE_Coyote-Maestas_2022_surface HSP82_YEAST_Flynn_2019 RDRP_I33A0_Li_2023 SERC_HUMAN_Xie_2023 AACC1_PSEAI_Dandage_2018 OXDA_RHOTO_Vanella_2023_expression HXK4_HUMAN_Gersing_2023_abundance MSH2_HUMAN_Jia_2020 SUMO1_HUMAN_Weile_2017 F7YBW7_MESOW_Ding_2023 A0A247D711_LISMN_Stadelmann_2021 Q8WTC7_9CNID_Somermeyer_2022 BRCA1_HUMAN_Findlay_2018 PPM1D_HUMAN_Miller_2022 DYR_ECOLI_Nguyen_2023 MET_HUMAN_Estevam_2023 LYAM1_HUMAN_Elazar_2016 MTHR_HUMAN_Weile_2021 KCNH2_HUMAN_Kozek_2020 HXK4_HUMAN_Gersing_2022_activity POLG_HCVJF_Qi_2014 CASP7_HUMAN_Roychowdhury_2020 PTEN_HUMAN_Matreyek_2021 RPC1_LAMBD_Li_2019_low-expression HSP82_YEAST_Mishra_2016 RNC_ECOLI_Weeks_2023 Q6WV13_9MAXI_Somermeyer_2022 ACE2_HUMAN_Chan_2020 OPSD_HUMAN_Wan_2019 ERBB2_HUMAN_Elazar_2016 POLG_CXB3N_Mattenberger_2021 NRAM_I33A0_Jiang_2016 UBC9_HUMAN_Weile_2017 HMDH_HUMAN_Jiang_2019 SRC_HUMAN_Nguyen_2022 SC6A4_HUMAN_Young_2021 RL40A_YEAST_Mavor_2016 GAL4_YEAST_Kitzman_2015 SRC_HUMAN_Chakraborty_2023_binding-DAS_25uM P53_HUMAN_Giacomelli_2018_Null_Nutlin CASP3_HUMAN_Roychowdhury_2020 I6TAH8_I68A0_Doud_2015 ILF3_HUMAN_Tsuboyama_2023_2L33 A4_HUMAN_Seuma_2022 DYR_ECOLI_Thompson_2019 RL40A_YEAST_Roscoe_2014 A4D664_9INFA_Soh_2019 ESTA_BACSU_Nutschel_2020 DLG4_RAT_McLaughlin_2012 Q59976_STRSQ_Romero_2015 MTH3_HAEAE_RockahShmuel_2015 GLPA_HUMAN_Elazar_2016 P53_HUMAN_Giacomelli_2018_Null_Etoposide RASK_HUMAN_Weng_2022_abundance NPC1_HUMAN_Erwood_2022_RPE1 IF1_ECOLI_Kelsic_2016 BLAT_ECOLX_Jacquier_2013 TADBP_HUMAN_Bolognesi_2019 MLAC_ECOLI_MacRae_2023 SPG1_STRSG_Wu_2016 Q53Z42_HUMAN_McShan_2019_expression LGK_LIPST_Klesmith_2015 KCNE1_HUMAN_Muhammad_2023_function TRPC_THEMA_Chan_2017 Q53Z42_HUMAN_McShan_2019_binding-TAPBPR PHOT_CHLRE_Chen_2023 O.3 Detailed results per DMS (modulo) 0.00 0.25 0.50 0.75 1.00 RNC_ECOLI_Weeks_2023 PHOT_CHLRE_Chen_2023 CASP7_HUMAN_Roychowdhury_2020 TRPC_SACS2_Chan_2017 LGK_LIPST_Klesmith_2015 YAP1_HUMAN_Araya_2012 PKN1_HUMAN_Tsuboyama_2023_1URF NUD15_HUMAN_Suiter_2020 FKBP3_HUMAN_Tsuboyama_2023_2KFV KCNE1_HUMAN_Muhammad_2023_function P53_HUMAN_Giacomelli_2018_Null_Nutlin VKOR1_HUMAN_Chiasson_2020_abundance D7PM05_CLYGR_Somermeyer_2022 C6KNH7_9INFA_Lee_2018 RASK_HUMAN_Weng_2022_binding-DARPin_K55 P53_HUMAN_Giacomelli_2018_Null_Etoposide SPIKE_SARS2_Starr_2020_binding S22A1_HUMAN_Yee_2023_activity RFAH_ECOLI_Tsuboyama_2023_2LCL RAD_ANTMA_Tsuboyama_2023_2CJJ A0A2Z5U3Z0_9INFA_Doud_2016 NPC1_HUMAN_Erwood_2022_RPE1 CASP3_HUMAN_Roychowdhury_2020 AMIE_PSEAE_Wrenbeck_2017 RASH_HUMAN_Bandaru_2017 Q59976_STRSQ_Romero_2015 UBR5_HUMAN_Tsuboyama_2023_1I2T CAR11_HUMAN_Meitlis_2020_lof SOX30_HUMAN_Tsuboyama_2023_7JJK POLG_PESV_Tsuboyama_2023_2MXD P84126_THETH_Chan_2017 RL40A_YEAST_Roscoe_2013 NRAM_I33A0_Jiang_2016 SAV1_MOUSE_Tsuboyama_2023_2YSB PSAE_SYNP2_Tsuboyama_2023_1PSE OPSD_HUMAN_Wan_2019 ODP2_GEOSE_Tsuboyama_2023_1W4G MTH3_HAEAE_RockahShmuel_2015 S22A1_HUMAN_Yee_2023_abundance SBI_STAAM_Tsuboyama_2023_2JVG KKA2_KLEPN_Melnikov_2014 P53_HUMAN_Kotler_2018 BLAT_ECOLX_Jacquier_2013 PR40A_HUMAN_Tsuboyama_2023_1UZC SPTN1_CHICK_Tsuboyama_2023_1TUD PABP_YEAST_Melamed_2013 PRKN_HUMAN_Clausen_2023 DN7A_SACS2_Tsuboyama_2023_1JIC DOCK1_MOUSE_Tsuboyama_2023_2M0Y BCHB_CHLTE_Tsuboyama_2023_2KRU TADBP_HUMAN_Bolognesi_2019 CP2C9_HUMAN_Amorosi_2021_abundance YAIA_ECOLI_Tsuboyama_2023_2KVT PITX2_HUMAN_Tsuboyama_2023_2L7M THO1_YEAST_Tsuboyama_2023_2WQG CUE1_YEAST_Tsuboyama_2023_2MYX SRBS1_HUMAN_Tsuboyama_2023_2O2W CBPA2_HUMAN_Tsuboyama_2023_1O6X NPC1_HUMAN_Erwood_2022_HEK293T SPG1_STRSG_Wu_2016 SPIKE_SARS2_Starr_2020_expression SCIN_STAAR_Tsuboyama_2023_2QFF RCRO_LAMBD_Tsuboyama_2023_1ORC YNZC_BACSU_Tsuboyama_2023_2JVD ISDH_STAAW_Tsuboyama_2023_2LHR MBD11_ARATH_Tsuboyama_2023_6ACV ARGR_ECOLI_Tsuboyama_2023_1AOY SPA_STAAU_Tsuboyama_2023_1LP1 RS15_GEOSE_Tsuboyama_2023_1A32 CATR_CHLRE_Tsuboyama_2023_2AMI OTC_HUMAN_Lo_2023 OTU7A_HUMAN_Tsuboyama_2023_2L2D SPG2_STRSG_Tsuboyama_2023_5UBS CBX4_HUMAN_Tsuboyama_2023_2K28 MYO3_YEAST_Tsuboyama_2023_2BTT MAFG_MOUSE_Tsuboyama_2023_1K1V VILI_CHICK_Tsuboyama_2023_1YU5 CP2C9_HUMAN_Amorosi_2021_activity EPHB2_HUMAN_Tsuboyama_2023_1F0M NKX31_HUMAN_Tsuboyama_2023_2L9R BBC1_YEAST_Tsuboyama_2023_1TG0 DNJA1_HUMAN_Tsuboyama_2023_2LO1 FECA_ECOLI_Tsuboyama_2023_2D1U GRB2_HUMAN_Faure_2021 HECD1_HUMAN_Tsuboyama_2023_3DKM BLAT_ECOLX_Stiffler_2015 HCP_LAMBD_Tsuboyama_2023_2L6Q PIN1_HUMAN_Tsuboyama_2023_1I6C RD23A_HUMAN_Tsuboyama_2023_1IFY A4GRB6_PSEAI_Chen_2020 CSN4_MOUSE_Tsuboyama_2023_1UFM RCD1_ARATH_Tsuboyama_2023_5OAO BLAT_ECOLX_Firnberg_2014 OBSCN_HUMAN_Tsuboyama_2023_1V1C VRPI_BPT7_Tsuboyama_2023_2WNM VG08_BPP22_Tsuboyama_2023_2GP8 TNKS2_HUMAN_Tsuboyama_2023_5JRT TCRG1_MOUSE_Tsuboyama_2023_1E0L UBE4B_HUMAN_Tsuboyama_2023_3L1X SR43C_ARATH_Tsuboyama_2023_2N88 SDA_BACSU_Tsuboyama_2023_1PV0 P53_HUMAN_Giacomelli_2018_WT_Nutlin AMFR_HUMAN_Tsuboyama_2023_4G3O SQSTM_MOUSE_Tsuboyama_2023_2RRU RBP1_HUMAN_Tsuboyama_2023_2KWH NUSA_ECOLI_Tsuboyama_2023_1WCL RPC1_BP434_Tsuboyama_2023_1R69 PPARG_HUMAN_Majithia_2016 NUSG_MYCTU_Tsuboyama_2023_2MI6 ProteinNPT Kermut (a) Top half 0.00 0.25 0.50 0.75 1.00 HIS7_YEAST_Pokusaeva_2019 A0A1I9GEU1_NEIME_Kennouche_2019 ENV_HV1B9_DuenasDecamp_2016 SCN5A_HUMAN_Glazer_2019 CALM1_HUMAN_Weile_2017 AICDA_HUMAN_Gajula_2014_3cycles ENVZ_ECOLI_Ghose_2023 CAS9_STRP1_Spencer_2017_positive B2L11_HUMAN_Dutta_2010_binding-Mcl-1 F7YBW8_MESOW_Aakre_2015 F7YBW7_MESOW_Ding_2023 TPK1_HUMAN_Weile_2017 ENV_HV1BR_Haddox_2016 REV_HV1H2_Fernandes_2016 A0A140D2T1_ZIKV_Sourisseau_2019 RAF1_HUMAN_Zinkus-Boltz_2019 TAT_HV1BR_Fernandes_2016 CCDB_ECOLI_Tripathi_2016 VKOR1_HUMAN_Chiasson_2020_activity CBS_HUMAN_Sun_2020 HSP82_YEAST_Cote-Hammarlof_2020_growth-H2O2 PA_I34A1_Wu_2015 KCNJ2_MOUSE_Coyote-Maestas_2022_function GLPA_HUMAN_Elazar_2016 ACE2_HUMAN_Chan_2020 CCR5_HUMAN_Gill_2023 KCNH2_HUMAN_Kozek_2020 Q6WV13_9MAXI_Somermeyer_2022 I6TAH8_I68A0_Doud_2015 BRCA2_HUMAN_Erwood_2022_HEK293T GFP_AEQVI_Sarkisyan_2016 A0A2Z5U3Z0_9INFA_Wu_2014 HEM3_HUMAN_Loggerenberg_2023 CD19_HUMAN_Klesmith_2019_FMC_singles UBE4B_MOUSE_Starita_2013 NCAP_I34A1_Doud_2015 KCNJ2_MOUSE_Coyote-Maestas_2022_surface MK01_HUMAN_Brenan_2016 TPOR_HUMAN_Bridgford_2020 HSP82_YEAST_Flynn_2019 GDIA_HUMAN_Silverstein_2021 OXDA_RHOTO_Vanella_2023_activity Q8WTC7_9CNID_Somermeyer_2022 A0A247D711_LISMN_Stadelmann_2021 A4D664_9INFA_Soh_2019 SHOC2_HUMAN_Kwon_2022 SRC_HUMAN_Chakraborty_2023_binding-DAS_25uM PAI1_HUMAN_Huttinger_2021 DYR_ECOLI_Thompson_2019 RDRP_I33A0_Li_2023 RPC1_LAMBD_Li_2019_high-expression DYR_ECOLI_Nguyen_2023 SPG1_STRSG_Olson_2014 MSH2_HUMAN_Jia_2020 SYUA_HUMAN_Newberry_2020 POLG_CXB3N_Mattenberger_2021 IF1_ECOLI_Kelsic_2016 CAR11_HUMAN_Meitlis_2020_gof AACC1_PSEAI_Dandage_2018 HXK4_HUMAN_Gersing_2023_abundance TRPC_THEMA_Chan_2017 RPC1_LAMBD_Li_2019_low-expression LYAM1_HUMAN_Elazar_2016 KCNE1_HUMAN_Muhammad_2023_expression ADRB2_HUMAN_Jones_2020 SUMO1_HUMAN_Weile_2017 PTEN_HUMAN_Mighell_2018 ESTA_BACSU_Nutschel_2020 SRC_HUMAN_Nguyen_2022 BLAT_ECOLX_Deng_2012 POLG_HCVJF_Qi_2014 UBC9_HUMAN_Weile_2017 RL40A_YEAST_Roscoe_2014 HSP82_YEAST_Mishra_2016 Q837P5_ENTFA_Meier_2023 SERC_HUMAN_Xie_2023 RL40A_YEAST_Mavor_2016 TPMT_HUMAN_Matreyek_2018 RL20_AQUAE_Tsuboyama_2023_1GYZ OXDA_RHOTO_Vanella_2023_expression BRCA1_HUMAN_Findlay_2018 CAPSD_AAV2S_Sinai_2021 A0A192B1T2_9HIV1_Haddox_2018 Q2N0S5_9HIV1_Haddox_2018 MTHR_HUMAN_Weile_2021 GCN4_YEAST_Staller_2018 DLG4_RAT_McLaughlin_2012 Q837P4_ENTFA_Meier_2023 MLAC_ECOLI_MacRae_2023 MET_HUMAN_Estevam_2023 CCDB_ECOLI_Adkar_2012 R1AB_SARS2_Flynn_2022 Q53Z42_HUMAN_McShan_2019_binding-TAPBPR ILF3_HUMAN_Tsuboyama_2023_2L33 HXK4_HUMAN_Gersing_2022_activity SRC_HUMAN_Ahler_2019 ERBB2_HUMAN_Elazar_2016 RASK_HUMAN_Weng_2022_abundance POLG_DEN26_Suphatrakul_2023 A4_HUMAN_Seuma_2022 Q53Z42_HUMAN_McShan_2019_expression DLG4_HUMAN_Faure_2021 HMDH_HUMAN_Jiang_2019 PTEN_HUMAN_Matreyek_2021 PPM1D_HUMAN_Miller_2022 ANCSZ_Hobbs_2022 SC6A4_HUMAN_Young_2021 GAL4_YEAST_Kitzman_2015 O.4 Detailed results per DMS (contiguous) 0.00 0.25 0.50 0.75 1.00 TADBP_HUMAN_Bolognesi_2019 SRC_HUMAN_Ahler_2019 PKN1_HUMAN_Tsuboyama_2023_1URF PHOT_CHLRE_Chen_2023 CASP7_HUMAN_Roychowdhury_2020 RL20_AQUAE_Tsuboyama_2023_1GYZ A4_HUMAN_Seuma_2022 Q53Z42_HUMAN_McShan_2019_expression PPM1D_HUMAN_Miller_2022 NUD15_HUMAN_Suiter_2020 CAPSD_AAV2S_Sinai_2021 ANCSZ_Hobbs_2022 RNC_ECOLI_Weeks_2023 VKOR1_HUMAN_Chiasson_2020_abundance RASH_HUMAN_Bandaru_2017 RASK_HUMAN_Weng_2022_binding-DARPin_K55 Q59976_STRSQ_Romero_2015 D7PM05_CLYGR_Somermeyer_2022 FKBP3_HUMAN_Tsuboyama_2023_2KFV KCNE1_HUMAN_Muhammad_2023_function TRPC_SACS2_Chan_2017 CASP3_HUMAN_Roychowdhury_2020 RAD_ANTMA_Tsuboyama_2023_2CJJ S22A1_HUMAN_Yee_2023_activity SPIKE_SARS2_Starr_2020_binding SOX30_HUMAN_Tsuboyama_2023_7JJK RL40A_YEAST_Roscoe_2013 AMIE_PSEAE_Wrenbeck_2017 CAR11_HUMAN_Meitlis_2020_lof BCHB_CHLTE_Tsuboyama_2023_2KRU MTH3_HAEAE_RockahShmuel_2015 RFAH_ECOLI_Tsuboyama_2023_2LCL PSAE_SYNP2_Tsuboyama_2023_1PSE CP2C9_HUMAN_Amorosi_2021_abundance P53_HUMAN_Kotler_2018 POLG_PESV_Tsuboyama_2023_2MXD PABP_YEAST_Melamed_2013 S22A1_HUMAN_Yee_2023_abundance UBR5_HUMAN_Tsuboyama_2023_1I2T ODP2_GEOSE_Tsuboyama_2023_1W4G OPSD_HUMAN_Wan_2019 BLAT_ECOLX_Jacquier_2013 KKA2_KLEPN_Melnikov_2014 SBI_STAAM_Tsuboyama_2023_2JVG P53_HUMAN_Giacomelli_2018_WT_Nutlin PR40A_HUMAN_Tsuboyama_2023_1UZC RCRO_LAMBD_Tsuboyama_2023_1ORC PRKN_HUMAN_Clausen_2023 SAV1_MOUSE_Tsuboyama_2023_2YSB PITX2_HUMAN_Tsuboyama_2023_2L7M SCIN_STAAR_Tsuboyama_2023_2QFF ISDH_STAAW_Tsuboyama_2023_2LHR CP2C9_HUMAN_Amorosi_2021_activity CUE1_YEAST_Tsuboyama_2023_2MYX P84126_THETH_Chan_2017 NPC1_HUMAN_Erwood_2022_RPE1 THO1_YEAST_Tsuboyama_2023_2WQG DOCK1_MOUSE_Tsuboyama_2023_2M0Y OTU7A_HUMAN_Tsuboyama_2023_2L2D SPTN1_CHICK_Tsuboyama_2023_1TUD SPG1_STRSG_Wu_2016 SPIKE_SARS2_Starr_2020_expression DN7A_SACS2_Tsuboyama_2023_1JIC CATR_CHLRE_Tsuboyama_2023_2AMI RS15_GEOSE_Tsuboyama_2023_1A32 YAIA_ECOLI_Tsuboyama_2023_2KVT SRBS1_HUMAN_Tsuboyama_2023_2O2W NPC1_HUMAN_Erwood_2022_HEK293T SPG2_STRSG_Tsuboyama_2023_5UBS YNZC_BACSU_Tsuboyama_2023_2JVD GRB2_HUMAN_Faure_2021 MBD11_ARATH_Tsuboyama_2023_6ACV ARGR_ECOLI_Tsuboyama_2023_1AOY SPA_STAAU_Tsuboyama_2023_1LP1 OTC_HUMAN_Lo_2023 CBX4_HUMAN_Tsuboyama_2023_2K28 MYO3_YEAST_Tsuboyama_2023_2BTT BLAT_ECOLX_Stiffler_2015 CBPA2_HUMAN_Tsuboyama_2023_1O6X BBC1_YEAST_Tsuboyama_2023_1TG0 VILI_CHICK_Tsuboyama_2023_1YU5 NKX31_HUMAN_Tsuboyama_2023_2L9R HCP_LAMBD_Tsuboyama_2023_2L6Q HECD1_HUMAN_Tsuboyama_2023_3DKM UBE4B_HUMAN_Tsuboyama_2023_3L1X BLAT_ECOLX_Firnberg_2014 MAFG_MOUSE_Tsuboyama_2023_1K1V FECA_ECOLI_Tsuboyama_2023_2D1U A4GRB6_PSEAI_Chen_2020 CSN4_MOUSE_Tsuboyama_2023_1UFM PIN1_HUMAN_Tsuboyama_2023_1I6C OBSCN_HUMAN_Tsuboyama_2023_1V1C DNJA1_HUMAN_Tsuboyama_2023_2LO1 VRPI_BPT7_Tsuboyama_2023_2WNM EPHB2_HUMAN_Tsuboyama_2023_1F0M SDA_BACSU_Tsuboyama_2023_1PV0 VG08_BPP22_Tsuboyama_2023_2GP8 RCD1_ARATH_Tsuboyama_2023_5OAO AMFR_HUMAN_Tsuboyama_2023_4G3O TNKS2_HUMAN_Tsuboyama_2023_5JRT TCRG1_MOUSE_Tsuboyama_2023_1E0L RD23A_HUMAN_Tsuboyama_2023_1IFY SR43C_ARATH_Tsuboyama_2023_2N88 PPARG_HUMAN_Majithia_2016 RBP1_HUMAN_Tsuboyama_2023_2KWH RPC1_BP434_Tsuboyama_2023_1R69 SQSTM_MOUSE_Tsuboyama_2023_2RRU NUSA_ECOLI_Tsuboyama_2023_1WCL NUSG_MYCTU_Tsuboyama_2023_2MI6 ProteinNPT Kermut (a) Top half 0.00 0.25 0.50 0.75 1.00 A0A1I9GEU1_NEIME_Kennouche_2019 ENV_HV1B9_DuenasDecamp_2016 HIS7_YEAST_Pokusaeva_2019 CALM1_HUMAN_Weile_2017 NRAM_I33A0_Jiang_2016 SCN5A_HUMAN_Glazer_2019 B2L11_HUMAN_Dutta_2010_binding-Mcl-1 ENVZ_ECOLI_Ghose_2023 PA_I34A1_Wu_2015 CAS9_STRP1_Spencer_2017_positive F7YBW7_MESOW_Ding_2023 AICDA_HUMAN_Gajula_2014_3cycles F7YBW8_MESOW_Aakre_2015 TPK1_HUMAN_Weile_2017 ENV_HV1BR_Haddox_2016 REV_HV1H2_Fernandes_2016 MSH2_HUMAN_Jia_2020 A4D664_9INFA_Soh_2019 A0A140D2T1_ZIKV_Sourisseau_2019 I6TAH8_I68A0_Doud_2015 GLPA_HUMAN_Elazar_2016 HSP82_YEAST_Cote-Hammarlof_2020_growth-H2O2 Q6WV13_9MAXI_Somermeyer_2022 VKOR1_HUMAN_Chiasson_2020_activity GFP_AEQVI_Sarkisyan_2016 ACE2_HUMAN_Chan_2020 NCAP_I34A1_Doud_2015 CBS_HUMAN_Sun_2020 RAF1_HUMAN_Zinkus-Boltz_2019 A0A2Z5U3Z0_9INFA_Wu_2014 MK01_HUMAN_Brenan_2016 KCNJ2_MOUSE_Coyote-Maestas_2022_function CCDB_ECOLI_Adkar_2012 CCR5_HUMAN_Gill_2023 CD19_HUMAN_Klesmith_2019_FMC_singles GDIA_HUMAN_Silverstein_2021 TPOR_HUMAN_Bridgford_2020 CCDB_ECOLI_Tripathi_2016 KCNJ2_MOUSE_Coyote-Maestas_2022_surface DYR_ECOLI_Thompson_2019 UBC9_HUMAN_Weile_2017 POLG_CXB3N_Mattenberger_2021 HEM3_HUMAN_Loggerenberg_2023 BLAT_ECOLX_Deng_2012 HSP82_YEAST_Flynn_2019 SYUA_HUMAN_Newberry_2020 BRCA2_HUMAN_Erwood_2022_HEK293T POLG_HCVJF_Qi_2014 RPC1_LAMBD_Li_2019_high-expression RDRP_I33A0_Li_2023 KCNH2_HUMAN_Kozek_2020 TAT_HV1BR_Fernandes_2016 Q2N0S5_9HIV1_Haddox_2018 UBE4B_MOUSE_Starita_2013 Q8WTC7_9CNID_Somermeyer_2022 SHOC2_HUMAN_Kwon_2022 TRPC_THEMA_Chan_2017 DYR_ECOLI_Nguyen_2023 PAI1_HUMAN_Huttinger_2021 OXDA_RHOTO_Vanella_2023_activity PTEN_HUMAN_Mighell_2018 SRC_HUMAN_Chakraborty_2023_binding-DAS_25uM LYAM1_HUMAN_Elazar_2016 AACC1_PSEAI_Dandage_2018 A0A192B1T2_9HIV1_Haddox_2018 A0A247D711_LISMN_Stadelmann_2021 CAR11_HUMAN_Meitlis_2020_gof IF1_ECOLI_Kelsic_2016 SPG1_STRSG_Olson_2014 HXK4_HUMAN_Gersing_2023_abundance DLG4_RAT_McLaughlin_2012 P53_HUMAN_Giacomelli_2018_Null_Etoposide Q837P4_ENTFA_Meier_2023 ADRB2_HUMAN_Jones_2020 SRC_HUMAN_Nguyen_2022 BRCA1_HUMAN_Findlay_2018 ERBB2_HUMAN_Elazar_2016 PTEN_HUMAN_Matreyek_2021 R1AB_SARS2_Flynn_2022 ESTA_BACSU_Nutschel_2020 GCN4_YEAST_Staller_2018 SUMO1_HUMAN_Weile_2017 RL40A_YEAST_Roscoe_2014 OXDA_RHOTO_Vanella_2023_expression ILF3_HUMAN_Tsuboyama_2023_2L33 KCNE1_HUMAN_Muhammad_2023_expression SC6A4_HUMAN_Young_2021 RPC1_LAMBD_Li_2019_low-expression MTHR_HUMAN_Weile_2021 SERC_HUMAN_Xie_2023 RL40A_YEAST_Mavor_2016 DLG4_HUMAN_Faure_2021 LGK_LIPST_Klesmith_2015 Q837P5_ENTFA_Meier_2023 TPMT_HUMAN_Matreyek_2018 MLAC_ECOLI_MacRae_2023 POLG_DEN26_Suphatrakul_2023 Q53Z42_HUMAN_McShan_2019_binding-TAPBPR HSP82_YEAST_Mishra_2016 A0A2Z5U3Z0_9INFA_Doud_2016 MET_HUMAN_Estevam_2023 C6KNH7_9INFA_Lee_2018 GAL4_YEAST_Kitzman_2015 RASK_HUMAN_Weng_2022_abundance HXK4_HUMAN_Gersing_2022_activity YAP1_HUMAN_Araya_2012 P53_HUMAN_Giacomelli_2018_Null_Nutlin HMDH_HUMAN_Jiang_2019

Section: F Ablation results
In Section 4.1, an ablation study was carried out by removing components of Kermut. In Table 2, the performance difference in Spearman correlation was shown. In Table F.1 we see the performance difference in MSE. The aggregated absolute Spearman and MSE values are shown in Tables F. 2 and F.3, while we show the performance per functional category in Tables F. 4 and F.5. These results suggest that an even wider combinatorial examination of Kermut's kernel composition might lead to slightly increased performance, e.g., by multiplying a Matérn 5/2 sequence kernel with the structure kernel. We must however note that the standard errors in Tables F. 2 and F.3 suggest that while the product and Matérn configurations lead to better results on the ablation datasets, the differences are not significant.
In addition to the shown methods, we considered sequence-only kernels as a baselines. One example is the inverse-multiquadratic Hamming (IMQ-H) kernel from [31]. This kernel, however, relies on the Hamming distance between one-hot encoded sequences for its covariances. For ProteinGym's single-mutant benchmark, this is not sufficient as the Hamming distance between all variants is 2, resulting in constant predictions for all folds and subsequent Spearman correlations narrowly centered on 0.  

Section: N Hyperparameter visualization N.1 Hyperparameter distributions
The distributions of Kermut's hyperparameters across a number of datasets from ProteinGym can be seen divided by split scheme in Figure N.1. λ is a scale parameter for the structure kernel, while π is a balancing parameter which lets the model assign importance to the structure and sequence kernels, respectively. γ 1 , γ 2 , and γ 3 are scale coefficients in the kernels' exponents. Their inverses are shown to facilitate easier comparison with the squared exponential kernel's lengthscale, l SE . 

Section: P Ethics
We have introduced a general framework to predict variant effects given labeled data. The intent of our work is to use the framework to model and subsequently optimize proteins. We acknowledge that -in principle -any protein property can be modeled (depending on the available data), which means that potentially harmful proteins can be engineered using our method. We encourage the community to use our proposed method for beneficial purposes only, such as the engineering of efficient enzymes or for the characterization of potentially pathogenic variants for the betterment of biological interpretation and clinical treatment.

Section: NeurIPS Paper Checklist
1. Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? Answer: [Yes] Justification: The contributions are listed in the introduction (Section 1) and match the results (Section 4) as described in the discussion (Section 5). Guidelines:
• The answer NA means that the abstract and introduction do not include the claims made in the paper. • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

Section: Limitations
Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Limitations are outlined in the discussion (Section 5). Guidelines:
• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. • The authors are encouraged to create a separate "Limitations" section in their paper.
• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. • The authors should reflect on the factors that influence the performance of the approach.
For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

Section: Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: See Section 3.3 for theoretical results and Appendix B for further details and proofs.
Guidelines:
• The answer NA means that the paper does not include theoretical results.
• All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. • All assumptions should be clearly stated or referenced in the statement of any theorems.
• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. • Theorems and Lemmas that the proof relies upon should be properly referenced.

Section: Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Implementation details (both software and data) can be found in Appendix C. Guidelines:
• The answer NA means that the paper does not include experiments.
• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. • Depending on the contribution, reproducibility can be accomplished in various ways.
For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. , with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility.
In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

Section: Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
Answer: [Yes] Justification: The full codebase can be found in an attached zip archive, while a public GitHub repository will be made available when anonymity is no longer an issue. All data is extracted directly from the ProteinGym repository, according to their instructions. Guidelines:
• The answer NA means that paper does not include experiments requiring code. • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

Section: Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: All experimental details are described in Appendix C. Guidelines:
• The answer NA means that the paper does not include experiments.
• The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. • The full details can be provided either with the code, in appendix, or as supplemental material.

Section: Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: The figures containing errorbars explicitly state which type of errorbar is shown. For the main results, the non-parametric bootstrap error is included in Table E.2. Guidelines:
• The answer NA means that the paper does not include experiments.
• The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. • The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)
• The assumptions made should be given (e.g., Normally distributed errors).
• It should be clear whether the error bar is the standard deviation or the standard error of the mean. • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.


References:
[b0] Adam J Riesselman; John B Ingraham; Debora S Marks (2018-10). Deep generative models of genetic variation capture the effects of mutations. Nature Methods
[b1] Alexander Rives; Joshua Meier; Tom Sercu; Siddharth Goyal; Zeming Lin; Jason Liu; Demi Guo; Myle Ott; C Lawrence Zitnick; Jerry Ma (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences
[b2] Jun Cheng; Guido Novati; Joshua Pan; Clare Bycroft; Akvilė Žemgulytė; Taylor Applebaum; Alexander Pritzel; Lai Hong Wong; Michal Zielinski; Tobias Sargeant; Rosalia G Schneider; Andrew W Senior; John Jumper; Demis Hassabis; Pushmeet Kohli; Žiga Avsec (2023-09). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science
[b3] Kotaro Tsuboyama; Justas Dauparas; Jonathan Chen; Elodie Laine; Mohseni Yasser; Jonathan J Behbahani; Niall M Weinstein; Sergey Mangan; Gabriel J Ovchinnikov;  Rocklin (2023-07). Megascale experimental analysis of protein folding stability in biology and design. Nature
[b4] Mihaly Varadi; Damian Bertoni; Paulyna Magana; Urmila Paramval; Ivanna Pidruchna; Malarvizhi Radhakrishnan; Maxim Tsenkov; Sreenath Nair; Milot Mirdita; Jingi Yeo; Oleg Kovalevskiy; Kathryn Tunyasuvunakool; Agata Laydon; Augustin Žídek; Hamish Tomlinson; Dhavanthi Hariharan; Josh Abrahamson; Tim Green; John Jumper; Ewan Birney; Martin Steinegger; Demis Hassabis; Sameer Velankar (2023-11). AlphaFold Protein Structure Database in 2024: Providing Structure Coverage for over 214 Million Protein Sequences. Nucleic Acids Research
[b5] Christian Dallago; Jody Mou; Kadina E Johnston; Bruce Wittmann; Nick Bhattacharya; Samuel Goldman; Ali Madani; Kevin K Yang (2021). FLIP: Benchmark tasks in fitness landscape inference for proteins. 
[b6] Pascal Notin; Aaron W Kollasch; Daniel Ritter; Lood Van Niekerk; Steffan Paul; Han Spinner; Nathan J Rollins; Ada Shaw; Rose Orenbuch; Ruben Weitzman; Jonathan Frazer; Mafalda Dias; Dinko Franceschi; Yarin Gal; Debora Susan Marks (2023-11). ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design. 
[b7] Jonathan Foldager; Mikkel Jordahn; Lars Kai Hansen; Michael Riis Andersen (2023). On the Role of Model Uncertainties in Bayesian Optimization. 
[b8] Jonathan Frazer; Pascal Notin; Mafalda Dias; Aidan Gomez; Joseph K Min; Kelly Brock; Yarin Gal; Debora S Marks (2021-11). Disease variant prediction with deep generative models of evolutionary data. Nature
[b9] Joshua Meier; Roshan Rao; Robert Verkuil; Jason Liu; Tom Sercu; Alex Rives (2021). Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in neural information processing systems
[b10] Pascal Notin; Mafalda Dias; Jonathan Frazer; Javier Marchena-Hurtado; Aidan N Gomez; Debora Marks; Yarin Gal (2022). Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. PMLR
[b11] Kevin K Yang; Zachary Wu; Frances H Arnold (2019-08). Machine-learning-guided directed evolution for protein engineering. Nature Methods
[b12] Ehsaneddin Asgari; R K Mohammad;  Mofrad (2015). Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PloS one
[b13] Kevin K Yang; Zachary Wu; Claire N Bedbrook; Frances H Arnold (2018-08). Learned protein embeddings for machine learning. Bioinformatics
[b14] Ahmed Elnaggar; Michael Heinzinger; Christian Dallago; Ghalia Rehawi; Yu Wang; Llion Jones; Tom Gibbs; Tamas Feher; Christoph Angerer; Martin Steinegger (2021). ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE transactions on pattern analysis and machine intelligence
[b15] Zeming Lin; Halil Akin; Roshan Rao; Brian Hie; Zhongkai Zhu; Wenting Lu; Nikita Smetanin; Robert Verkuil; Ori Kabeli; Yaniv Shmueli (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science
[b16] Jin Su; Chenchen Han; Yuyang Zhou; Junjie Shan; Xibin Zhou; Fajie Yuan (2024). SaProt: Protein Language Modeling with Structure-aware Vocabulary. 
[b17] Chloe Hsu; Hunter Nisonoff; Clara Fannjiang; Jennifer Listgarten (2022-01). Learning protein fitness models from evolutionary and assay-labeled data. Nature Biotechnology
[b18] Pascal Notin; Ruben Weitzman; Debora Marks; Yarin Gal (). ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers. 
[b19]  Curran Associates;  Inc (2023). . 
[b20] M Roshan; Jason Rao; Robert Liu; Joshua Verkuil; John Meier; Pieter Canny; Tom Abbeel; Alexander Sercu;  Rives (2021-07). MSA Transformer. PMLR
[b21] M Roshan; Nicholas Rao; Neil Bhattacharya; Yan Thomas; Peter Duan; John Chen; Pieter Canny; Yun Abbeel;  Song (2019). Evaluating protein transfer learning with TAPE. Advances in neural information processing systems
[b22] Minghao Xu; Zuobai Zhang; Jiarui Lu; Zhaocheng Zhu; Yangtian Zhang; Ma Chang; Runcheng Liu; Jian Tang (2022). PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding. Advances in Neural Information Processing Systems
[b23] Raphael John; Lamarre Townshend; Martin Vögele; Patricia Adriana Suriana; Alexander Derry; Alexander Powers; Yianni Laloudakis; Sidhika Balachandar; Bowen Jing; Brandon M Anderson; Stephan Eismann; Risi Kondor; Russ Altman; Ron O Dror (2021). ATOM3D: Tasks On Molecules in Three Dimensions. 
[b24] Peter Mørch Groth; Richard Michael; Jesper Salomon; Pengfei Tian; Wouter Boomsma (2023-06). FLOP: Tasks for Fitness Landscapes Of Protein wildtypes. 
[b25] Christina Leslie; Eleazar Eskin; William Stafford; Noble  (2001-12). The spectrum kernel: A string kernel for SVM protein classification. WORLD SCIENTIFIC
[b26] Christina S Leslie; Eleazar Eskin; Adiel Cohen; Jason Weston; William Stafford; Noble  (2004-03). Mismatch String Kernels for Discriminative Protein Classification. Bioinformatics
[b27] Henry Moss; David Leslie; Daniel Beck; Javier Gonzalez; Paul Rayson (2020). BOSS: Bayesian Optimization over String Spaces. Advances in neural information processing systems
[b28] Nora C Toussaint; Christian Widmer; Oliver Kohlbacher; Gunnar Rätsch (2010). Exploiting physico-chemical properties in string kernels. BMC bioinformatics
[b29] Philip A Romero; Andreas Krause; Frances H Arnold (2013-01). Navigating the protein fitness landscape with Gaussian processes. Proceedings of the National Academy of Sciences
[b30] Jonathan C Greenhalgh; Sarah A Fahlberg; Brian F Pfleger; Philip A Romero (2021). Machine learning-guided acyl-ACP reductase engineering for improved in vivo fatty alcohol production. Nature communications
[b31] Alan Nawzad Amin; Eli Nathan Weinstein; Debora Susan Marks (2023). Biological sequence kernels with guaranteed flexibility. 
[b32] Emmi Jokinen; Markus Heinonen; Harri Lähdesmäki (2018-07). mGPfusion: Predicting protein stability changes with Gaussian process kernel learning and data fusion. Bioinformatics
[b33] Andrew Leaver-Fay; Michael Tyka; Steven M Lewis; Oliver F Lange; James Thompson; Ron Jacak; Kristian W Kaufman; P Douglas Renfrew; Colin A Smith; Will Sheffler (2011). ROSETTA3: an object-oriented software suite for the simulation and design of macromolecules. Methods in enzymology
[b34] Jonathan Parkinson; Wei Wang (2023). Linear-scaling kernels for protein sequences and small molecules outperform deep learning while providing uncertainty quantitation and improved interpretability. Journal of Chemical Information and Modeling
[b35] Haoyang Zeng; David K Gifford (2019). Quantification of Uncertainty in Peptide-MHC Binding Prediction Improves High-Affinity Peptide Selection for Therapeutic Design. Cell systems
[b36] Brian Hie; Bryan D Bryson; Bonnie Berger (2020-11). Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design. Cell Systems
[b37] Hunter Nisonoff; Yixin Wang; Jennifer Listgarten (2023). Coherent Blending of Biophysics-Based Knowledge with Bayesian Neural Networks for Robust Protein Property Prediction. ACS Synthetic Biology
[b38] Young Su Ko; Jonathan Parkinson; Cong Liu; Wei Wang (2024). TUnA: an uncertainty-aware transformer model for sequence-based protein-protein interaction prediction. Briefings in Bioinformatics
[b39] Jeremiah Liu; Zi Lin; Shreyas Padhy; Dustin Tran; Tania Bedrax Weiss; Balaji Lakshminarayanan (2020). Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness. Advances in neural information processing systems
[b40] K Fredrik; Martin Gustafsson; Thomas B Danelljan;  Schon (2020). Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision. 
[b41] Gabriele Scalia; Colin A Grambow; Barbara Pernici; Yi-Pei Li; William H Green (2020). Evaluating Scalable Uncertainty Estimation Methods for Deep Learning-Based Molecular Property Prediction. ACS
[b42] Dan Levi; Liran Gispan; Niv Giladi; Ethan Fetaya (2022). Evaluating and Calibrating Uncertainty Prediction in Regression Tasks. Sensors
[b43] Lior Hirschfeld; Kyle Swanson; Kevin Yang; Regina Barzilay; Connor W Coley (2020). Uncertainty Quantification Using Neural Networks for Molecular Property Prediction. Journal of Chemical Information and Modeling
[b44] Kevin Tran; Willie Neiswanger; Junwoong Yoon; Qingyang Zhang; Eric Xing; Zachary W Ulissi (2020). Methods for comparing uncertainty quantifications for material property predictions. Machine Learning: Science and Technology
[b45] Kevin P Greenman; Ava P Amini; Kevin K Yang (2023). Benchmarking uncertainty quantification for protein engineering. bioRxiv
[b46] Yinghao Li; Lingkai Kong; Yuanqi Du; Yue Yu; Yuchen Zhuang; Wenhao Mu; Chao Zhang (2024). MUBen: Benchmarking the uncertainty of molecular representation models. Transactions on Machine Learning Research
[b47] Stephan Thaler; Felix Mayr; Siby Thomas; Alessio Gagliardi; Julija Zavadlav (2024). Active learning graph neural networks for partial charge prediction of metal-organic frameworks via dropout Monte Carlo. npj Computational Materials
[b48] Yarin Gal; Zoubin Ghahramani (2016-06). Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. PMLR
[b49] Richard Michael; Jacob Kaestel-Hansen; Peter Mørch Groth; Simon Bartels; Jesper Salomon; Pengfei Tian; Nikos S Hatzakis; Wouter Boomsma (2024-05). A Systematic Analysis of Regression Models for Protein Engineering. PLOS Computational Biology
[b50] John Ingraham; Vikas Garg; Regina Barzilay; Tommi Jaakkola (2019). Generative models for graph-based protein design. Advances in neural information processing systems
[b51] Chloe Hsu; Robert Verkuil; Jason Liu; Zeming Lin; Brian Hie; Tom Sercu; Adam Lerer; Alexander Rives (2022-06). Learning Inverse Folding from Millions of Predicted Structures. PMLR
[b52] Justas Dauparas; Ivan Anishchenko; Nathaniel Bennett; Hua Bai; Robert J Ragotte; Lukas F Milles; I M Basile; Alexis Wicky; Rob J Courbet; Neville De Haas;  Bethel (2022). Robust deep learning-based protein sequence design using ProteinMPNN. Science
[b53] Zhangyang Gao; Cheng Tan; Stan Z Li (2023). PiFold: Toward effective and efficient protein inverse folding. 
[b54] Zhangyang Gao; Cheng Tan; Xingran Chen; Yijie Zhang; Jun Xia; Siyuan Li; Stan Z Li (2024). KW-Design: Pushing the Limit of Protein Design via Knowledge Refinement. 
[b55] Xinyi Zhou; Guangyong Chen; Junjie Ye; Ercheng Wang; Jun Zhang; Cong Mao; Zhanwei Li; Jianye Hao; Xingxu Huang; Jin Tang; Pheng ; Ann Heng (2023). ProRefiner: an entropy-based refining strategy for inverse protein folding with global graph attention. Nature Communications
[b56] Milong Ren; Chungong Yu; Dongbo Bu; Haicang Zhang (2024). Accurate and robust protein sequence design with CarbonDesign. Nature Machine Intelligence
[b57] Wen Torng; Russ B Altman (2017). 3D deep convolutional neural networks for amino acid environment similarity analysis. BMC bioinformatics
[b58] Pablo Gainza; Freyr Sverrisson; Frederico Monti; Emanuele Rodola; Davide Boscaini; Bruno E Michael M Bronstein;  Correia (2020). Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nature Methods
[b59] Pablo Gainza; Sarah Wehrle; Alexandra Van Hall-Beauvais; Anthony Marchand; Andreas Scheck; Zander Harteveld; Stephen Buckley; Dongchun Ni; Shuguang Tan; Freyr Sverrisson; Casper Goverde; Priscilla Turelli; Charlène Raclot; Alexandra Teslenko; Martin Pacesa; Stéphane Rosset; Sandrine Georgeon; Jane Marsden; Aaron Petruzzella; Kefang Liu; Zepeng Xu; Yan Chai; Pu Han; George F Gao; Elisa Oricchio; Beat Fierz; Didier Trono; Henning Stahlberg; Michael Bronstein; Bruno E Correia (2023-05). De novo design of protein interactions with learned surface fingerprints. Nature
[b60] Raghav Shroff; Austin W Cole; Daniel J Diaz; Barrett R Morrow; Isaac Donnell; Ankur Annapareddy; Jimmy Gollihar; Andrew D Ellington; Ross Thyer (2020). Discovery of Novel Gain-of-Function Mutations Guided by Structure-Based Deep Learning. ACS synthetic biology
[b61] V Anastasiya; Daniel J Kulikova; James M Diaz; Andrew D Loy; Claus O Ellington;  Wilke (2021). Learning the local landscape of protein structures with convolutional neural networks. Journal of Biological Physics
[b62] Zsolt Fazekas; K Dóra; András Menyhárd;  Perczel (2024). LoCoHD: a metric for comparing local environments of proteins. Nature Communications
[b63] David Ding; Ada Y Shaw; Sam Sinai; Nathan Rollins; Noam Prywes; David F Savage; Michael T Laub; Debora S Marks (2024-02). Protein design using structure-based residue preferences. Nature Communications
[b64] John Jumper; Richard Evans; Alexander Pritzel; Tim Green; Michael Figurnov; Olaf Ronneberger; Kathryn Tunyasuvunakool; Russ Bates; Augustin Žídek; Anna Potapenko; Alex Bridgland; Clemens Meyer; A A Simon; Andrew J Kohl; Andrew Ballard; Bernardino Cowie; Stanislav Romera-Paredes; Rishub Nikolov; Jonas Jain; Trevor Adler; Stig Back; David Petersen; Ellen Reiman; Michal Clancy; Martin Zielinski; Michalina Steinegger; Tamas Pacholska; Sebastian Berghammer; David Bodenstein; Oriol Silver; Andrew W Vinyals; Koray Senior; Pushmeet Kavukcuoglu; Demis Kohli;  Hassabis (2021-08). Highly accurate protein structure prediction with AlphaFold. Nature
[b65] Carl Edward Rasmussen; Christopher K I Williams (2006). Gaussian Processes for Machine Learning. MIT Press
[b66] Richard Michael; Simon Bartels; Miguel González-Duque; Yevgen Zainchkovskyy; Jes Frellsen; Søren Hauberg; Wouter Boomsma (2024). A Continuous Relaxation for Discrete Bayesian Optimization. 
[b67] Brian Hie; Bryan D Bryson; Bonnie Berger (2020). Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design. Cell systems
[b68] Morris H Degroot; Stephen E Fienberg (1983). The Comparison and Evaluation of Forecasters. Journal of the Royal Statistical Society. Series D (The Statistician)
[b69] A Michael; Doeke R Stiffler; Rama Hekstra;  Ranganathan (2015-02). Evolvability as a Function of Purifying Selection in TEM-1 β-Lactamase. Cell
[b70] C Nicholas; C Anders Wu; Yushen Olson; Shuai Du; Kevin Le; Roland Tran; Danyang Remenyi;  Gong; Q Laith; Hangfei Al-Mawsawi; Ting-Ting Qi; Ren Wu;  Sun (2015-07). Functional Constraint Profiling of a Viral Protein Reveals Discordance of Evolutionary Conservation and Functionality. PLOS Genetics
[b71] Aliete Wan; Emily Place; Eric A Pierce; Jason Comander (2019-08). Characterizing Variants of Unknown Significance in Rhodopsin: A Functional Genomics Approach. Human Mutation
[b72] Edward Snelson; Zoubin Ghahramani (2005). Sparse Gaussian Processes using Pseudo-inputs. Advances in neural information processing systems
[b73] Ke Wang; Geoff Pleiss; Jacob Gardner; Stephen Tyree; Kilian Q Weinberger; Andrew Gordon; Wilson  (2019). Exact Gaussian Processes on a Million Data Points. Advances in neural information processing systems
[b74] Giacomo Meanti; Luigi Carratino; Lorenzo Rosasco; Alessandro Rudi (). Kernel Methods Through the Roof: Handling Billions of Points Efficiently. 
[b75]  Curran Associates;  Inc (2020). . 
[b76] Imre Csiszár; Paul C Shields (2004). Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory
[b77] Nicki Skafte Detlefsen; Søren Hauberg; Wouter Boomsma (1914-04). Learning meaningful representations of protein sequences. Nature Communications
[b78] Thomas Gärtner; Peter A Flach; Adam Kowalczyk; Alex J Smola (2002). Multi-instance kernels. Morgan Kaufmann Publishers Inc
[b79] David Duvenaud (2014). Automatic Model Construction with Gaussian Processes. 
[b80] Jacob Gardner; Geoff Pleiss; Kilian Q Weinberger; David Bindel; Andrew G Wilson (2018). GPy-Torch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration. Advances in neural information processing systems
[b81] Nicholas G Polson; James G Scott (2012). On the Half-Cauchy Prior for a Global Scale Parameter. Bayesian Analysis
[b82] Ilya Loshchilov; Frank Hutter (2019). Decoupled Weight Decay Regularization. 

Figures:
Figure fig_0: 1
Type: figure
Caption: Figure 1 :1Figure 1: Overview of Kermut's structure kernel. Using an inverse folding model, structureconditioned amino acid distributions are computed for all sites in the reference protein. The structure kernel yields high covariances between two variants if the local environments are similar, if the mutation probabilities are similar, and if the mutates sites are physically close. Constructed examples of expected covariances between variant x 1 and x 2,3,4 are shown.
Data: 

Figure fig_1: 2
Type: figure
Caption: Figure 2 :2Figure 2: Distribution of predictive variances for datasets with double mutants, grouped by domain.The three first elements correspond to the three split-schemes from ProteinGym. The third and fourth correspond to training on both single and double mutants, and testing on each, respectively. For the last column, we train on single and test on double mutants, corresponding to an extrapolation setting.
Data: 

Figure fig_2: 
Type: figure
Caption: Figure2shows the distributions of mean predictive variances in the six domains. In the three single mutant domains, we observe that the uncertainties increase from scheme to scheme, reflecting the difficulties of the tasks and analogously the expected performance scores (Table1). When training on both single and double mutants (1M/2M), we observe a lower uncertainty on double mutants than single mutants. For many of the multi-mutant datasets, the mutants are not uniformly sampled but often include a fixed single mutation. A possible explanation is thus that it might be more challenging to decouple the signal from a double mutation into its constituent single mutation signals. In the extrapolation setting, we observe large predictive uncertainties, as expected. One explanation of the discrepancy between the variance distributions in the multi-mutant domains might lie in the difference in target distributions between training and test sets. Figure I.1 in the appendix shows the overall target distribution of assays for the 51 considered multi-mutant datasets. The single and double mutants generally belong to different modalities, where the double mutants often lead to a loss of fitness. This shows the difficulty of predicting on domains not encountered during training. For reference, we include the results for the multi-mutant domains in TableG.1 in the appendix.
Data: 

Figure fig_3: 
Type: figure
Caption: 29 (±0.21) cv: 0.06 (±0.02) (b) Error-based calibration curves.
Data: 

Figure fig_4: 3
Type: figure
Caption: Figure 3 :3Figure 3: Calibration curves for Kermut using different methods. Mean ECE/ENCE values (±2σ) are shown. Dashed line (x = y) corresponds to ideal calibration. The row order corresponds to the ordering in Table3. (a) exhibits good calibration as indicated by curves close to the diagonal and ECE values close to zero, albeit with under-confident uncertainties in the second row. In (b), Kermut is also relatively well-calibrated, as indicated by the increasing curves, albeit with large variances along both axes. The low coefficients of variation (c v ) indicate similar predictive variances in each setting. Overall, Kermut achieves good calibration in most cases as a result of the designed kernel.
Data: 

Figure fig_5: 1
Type: figure
Caption: Figure I. 1 :1Figure I.1: Histogram over normalized assay values for 51/69 datasets with multi-mutants. All datasets with more than 7500 variants are ignored. The histograms are colored according to the number of mutations per variant. The assay distribution belong to different modalities depending on the number of mutations present, where double mutations commonly lead to a loss of fitness.
Data: 

Figure fig_6: 224
Type: figure
Caption: Figure J. 2 :Figure K. 2 :Figure K. 4 :224Figure J.2: Boxplot of intercepts and slopes of error-based calibration curves for Kermut and a baseline GP with the sequence kernel on ESM-2 embeddings. Perfect calibration has an intercept of zero and a slope of one (indicated by dashed lines). The baseline GP has poor calibration compared to Kermut. Horizontal lines indicate 0.9, 0.75, 0.5, 0.25, 0.1 quantiles.
Data: 

Figure fig_7: 123
Type: figure
Caption: Figure L. 1 :LFigure L. 2 :Figure L. 3 :123Figure L.1: Calibration curves for ProteinNPT on the four dataset from Table3. Standard deviation over CV folds is shown as vertical bars. Perfect calibration corresponds to diagonal lines (y = x) and is shown as dashed lines in each plot. The predictive uncertainties for ProteinNPT are very small, resulting in poor out-of-the-box calibration as seen on the x-axis in (b) and in Figures L.2 to L.5. However, as indicated by the trend in both (a) and (b), the errors correlate with the magnitude of the uncertainties. This suggests that a recalibration might be sufficient to achieve good calibration.
Data: 

Figure fig_8: 4
Type: figure
Caption: Figure L. 4 :4Figure L.4: Predicted means by ProteinNPT (±2σ) vs. true values. Columns correspond to CVschemes. Rows correspond to test folds. Perfect prediction corresponds to dashed diagonal line (x = y). While the predictions are good, the model is very overconfident. Despite the relatively poor predictions, the model remains overconfident.
Data: 

Figure fig_9: 52
Type: figure
Caption: 5 RBFFigure N. 2 :52Figure N.2: Distributions of Kermut's hyperparameter across ProteinGym assays and splits visualized against dataset sizes. The inverses of γ 1 , γ 2 , and γ 3 are shown to facilitate easier comparison with the sequence kernel's lengthscale.
Data: 

Figure fig_10: 4
Type: figure
Caption: Figure N. 4 :4Figure N.4: The structure kernel scale hyperparameter, λ, is shown against the kernel balancing hyperparameter, π.
Data: 

Figure tab_0: 1
Type: table
Caption: Performance on the ProteinGym benchmark. Best results are bold. Kermut reaches superior performance across splits with significant gains in the challenging modulo and contiguous settings. OHE and NPT model types correspond to one-hot encodings and non-parametric transformers.
Data: ModelModel nameSpearman (↑)MSE (↓)typeContig. Mod. Rand. Avg. Contig. Mod. Rand. Avg.OHENone0.0640.027 0.579 0.2241.1581.125 0.898 1.061ESM-1v0.3670.368 0.514 0.4170.9770.949 0.764 0.897DeepSequence0.4000.400 0.521 0.4400.9670.940 0.767 0.891MSAT0.4100.412 0.536 0.4530.9630.934 0.749 0.882TranceptEVE0.4410.440 0.550 0.4770.9530.914 0.743 0.870Embed. ESM-1v0.4810.506 0.639 0.5420.9370.861 0.563 0.787MSAT0.5250.538 0.642 0.5680.8360.795 0.573 0.735Tranception0.4900.526 0.696 0.5710.9720.833 0.503 0.769NPTProteinNPT0.5470.564 0.730 0.6130.8200.771 0.459 0.683GPKermut0.6100.633 0.744 0.6620.6990.652 0.414 0.589

Figure tab_1: 2
Type: table
Caption: Ablation results. Key components of the kernel are removed and the model is trained and evaluated on 174/217 assays from the ProteinGym benchmark. The ablation column shows the alteration to the GP formulation. The metrics are subtracted from Kermut to show the change in performance. Negative ∆Spearman values indicate a drop in performance.
Data: 

Figure tab_3: 3
Type: table
Caption: Details and results for four diverse ProteinGym datasets used for calibration analysis. The results show the Spearman correlation for each CV-scheme and the average correlation.
Data: Uniprot IDSpearman (↑)DetailsContig. Mod. Rand. Avg.NLAssaySourceBLAT_ECOLX0.8040.826 0.909 0.846 4996 286 Organismal fitness[69]PA_I34A10.2260.457 0.539 0.407 1820 716 Organismal fitness[70]TCRG1_MOUSE0.8490.849 0.928 0.87562137 Stability[4]OPSD_HUMAN0.7390.734 0.727 0.734165 348 Expression[71]

Figure tab_5: C
Type: table
Caption: 1: Approximate wall clock times for training and evaluating Kermut and ProteinNPT for a single split scheme, i.e., by using 5-fold cross validation. While the runtime of Kermut scales with dataset size, ProteinNPT appears to scale more strongly with sequence length due to the tri-axial attention.
Data: DatasetKermut runtime PNPT runtimeNLBLAT_ECOLX111s≈ 32h4996 286PA_I34A145s≈ 52h1820 716TCRG1_MOUSE19s≈ 22h62137OPSD_HUMAN14s≈ 40h165 348

Figure tab_6: E
Type: table
Caption: 1: Aggregated Spearman results on the ProteinGym substitution benchmark. Performance is shown per cross-validation scheme. Kermut reaches superior performance across the board. The fifth data column shows the non-parametric bootstrap standard error of the difference between the Spearman performance for each model and Kermut, computed over 10,000 bootstrap samples from the set of proteins in the ProteinGym substitution benchmark.
Data: Model nameSpearman per scheme (↑)Std. err.Cont. Mod. Rand.Avg.Kermut0.610 0.633 0.744 0.6620.000ProteinNPT0.547 0.564 0.730 0.6130.009Tranception Emb.0.490 0.526 0.696 0.5710.008MSAT Emb.0.525 0.538 0.642 0.5680.013ESM-1v Emb.0.481 0.506 0.639 0.5420.011TranceptEVE + OHE0.441 0.440 0.550 0.4770.012Tranception + OHE0.419 0.419 0.535 0.4580.012MSAT + OHE0.410 0.412 0.536 0.4530.014DeepSequence + OHE 0.400 0.400 0.521 0.4400.016ESM-1v + OHE0.367 0.368 0.514 0.4170.014OHE0.064 0.027 0.579 0.2240.014

Figure tab_7: E
Type: table
Caption: 2: Aggregated Spearman results on the ProteinGym substitution benchmark. Performance is shown per functional category. Kermut reaches superior performance across the board.
Data: Model nameSpearman per function (↑)Activity Binding Expression Fitness StabilityKermut0.6060.6300.6720.5810.824ProteinNPT0.5770.5360.6370.5450.772Tranception Emb.0.5200.5290.6130.5190.674MSAT Emb.0.5470.4700.5840.4930.749ESM-1v Emb.0.4870.4500.5870.4680.717TranceptEVE + OHE0.5020.4440.4760.4700.493Tranception + OHE0.4750.4160.4760.4480.473MSAT + OHE0.4800.3930.4630.4370.491DeepSequence + OHE0.4670.4180.4240.4220.471ESM-1v + OHE0.4210.3630.4520.3830.463OHE0.2130.2120.2260.1940.273

Figure tab_8: E
Type: table
Caption: 3: Aggregated MSE results on the ProteinGym substitution benchmark. Performance is shown per cross-validation scheme. Kermut reaches superior performance across the board. The fifth data column shows the non-parametric bootstrap standard error of the difference between the MSE performance for each model and Kermut, computed over 10,000 bootstrap samples from the set of proteins in the ProteinGym substitution benchmark.
Data: Model nameMSE per scheme (↓)Std. err.Cont. Mod. Rand.Avg.Kermut0.699 0.652 0.414 0.5890.000ProteinNPT0.820 0.771 0.459 0.6830.017MSAT Emb.0.836 0.795 0.573 0.7350.021Tranception Emb.0.972 0.833 0.503 0.7690.023ESM-1v Emb.0.937 0.861 0.563 0.7870.030TranceptEVE + OHE0.953 0.914 0.743 0.8700.019MSAT + OHE0.963 0.934 0.749 0.8820.020DeepSequence + OHE 0.967 0.940 0.767 0.8910.017Tranception + OHE0.985 0.934 0.766 0.8950.022ESM-1v + OHE0.977 0.949 0.764 0.8970.013OHE1.158 1.125 0.898 1.0610.017

Figure tab_9: E
Type: table
Caption: 4: Aggregated results on the ProteinGym substitution benchmark. Performance is shown per functional category. Kermut reaches superior performance across the board.
Data: Model nameMSE per function (↓)Activity Binding Expression Fitness StabilityKermut0.6300.8430.5230.6570.289ProteinNPT0.7031.0160.5780.7520.368MSAT Emb.0.7281.0920.6600.7890.405Tranception Emb.0.8141.0800.6390.7880.525ESM-1v Emb.0.7991.2310.6550.7920.456TranceptEVE + OHE0.7931.1990.7800.8250.756MSAT + OHE0.8101.2210.7880.8360.756DeepSequence + OHE0.8301.1400.8320.8600.793Tranception + OHE0.8311.2460.7870.8450.765ESM-1v + OHE0.8431.1920.7950.8700.783OHE1.0221.3060.9861.0400.949

Figure tab_10: E
Type: table
Caption: 5: Performance on the ProteinGym benchmark using the corrected splits.
Data: Model Model nameSpearman (↑)MSE (↓)typeContig. Mod. Rand. Avg. Contig. Mod. Rand. Avg.GPKermut0.5910.631 0.744 0.6550.7390.680 0.141 0.611

Figure tab_11: F
Type: table
Caption: 3: Ablation results. Key components of the kernel are removed or altered and the model is trained and evaluated on 174/217 assays from the ProteinGym benchmark. The ablation column shows the alteration to the kernel formulation.
Data: DescriptionAblationMSE (↓)Std. err.Contig. Mod. Rand.Avg.No structure kernelkstruct = 00.825 0.7690.460 0.6840.012No sequence kernelkseq = 00.791 0.7440.492 0.6760.008No inter-residue dist.k d = 10.789 0.7430.426 0.6520.006No mut. prob./site comp.kp = kH = 10.775 0.7220.436 0.6440.007Const. meanm(x) = α0.770 0.7180.429 0.6390.005No mut. prob.kp = 10.761 0.7040.436 0.6340.006No site comp.kH = 10.735 0.6910.421 0.6160.002Kermut (SE in kH )kH = kSE0.738 0.6900.416 0.6140.003Kermut (JSD in kH )kH = kJSD0.731 0.6840.415 0.6100.002Kermut (Matérn in kseq)kseq = kMatérn5/20.724 0.6780.414 0.6050.002Kermut (product)k = kstruct × kseq0.726 0.6780.411 0.6050.004Kermut0.730 0.6830.420 0.6110.000

Figure tab_12: F
Type: table
Caption: 4: Ablation results. Key components of the kernel are removed or altered and the model is trained and evaluated on 174/217 assays from the ProteinGym benchmark. The ablation column shows the alteration to the kernel formulation. Performance is shown per functional category.
Data: Model nameAblationSpearman per function (↑)Activity Binding Expression Fitness StabilityConst. meanm(x) = α0.5790.5800.6510.5360.821No site comp.kH = 10.5900.6190.6640.5740.820No mut. prob.kp = 10.5900.6110.6510.5640.801No mut. prob./site comp.kp = kH = 10.5690.6010.6480.5570.790No inter-residue dist.k d = 10.5620.5600.6340.5460.818No sequence kernelkseq = 00.5780.6160.6110.5470.716No structure kernelkstruct = 00.5310.5290.6140.5190.778Kermut (Matérn in kseq)kseq = kMatérn5/20.6040.6250.6670.5810.828Kermut (SE in kH )kH = kSE0.5930.6230.6640.5760.818Kermut (JSD in kH )kH = kJSD0.5990.6280.6640.5780.823Kermut (product)k = kstruct × kseq0.5990.6320.6740.5780.829Kermut0.6020.6250.6650.5780.824

Figure tab_13: F
Type: table
Caption: 5: Ablation results. Key components of the kernel are removed or altered and the model is trained and evaluated on 174/217 assays from the ProteinGym benchmark. The ablation column shows the alteration to the kernel formulation. Performance is shown per functional category.
Data: Model nameAblationMSE per function (↓)Activity Binding Expression Fitness StabilityConst. meanm(x) = α0.6720.9320.5750.7100.304No site comp.kH = 10.6510.9060.5570.6690.295No mut. prob.kp = 10.6510.9230.5840.6820.327No mut. prob./site comp.kp = kH = 10.6720.9330.5860.6880.343No inter-residue dist.k d = 10.6960.9650.5990.7020.301No sequence kernelkseq = 00.6730.9260.6220.7180.440No structure kernelkstruct = 00.7120.9930.6260.7340.358Kermut (Matérn in kseq)kseq = kMatérn5/20.6360.8910.5550.6620.283Kermut (SE in kH )kH = kSE0.6480.8990.5580.6690.299Kermut (JSD in kH )kH = kJSD0.6420.8930.5580.6670.291Kermut (product)k = kstruct × kseq0.6430.8840.5480.6670.283Kermut0.6400.9030.5580.6650.289

Figure tab_14: G
Type: table
Caption: 1: Results in multi-mutant setting. Each row corresponds to a different setting of training and evaluation domain. Third row corresponds to the fold_rand_multiples split-scheme from ProteinGym. Experiments are carried out on 51 datasets, corresponding to all datasets with multiple mutants with less than 7500 variants in total with the exception of GCN4_YEAST_Staller_2018, which has been removed due to its high mutation count.
Data: Spearman (↑)MSE (↓)DomainKermut Kermut  *kseqKermut Kermut  *kseq1M/2M→1M0.9100.9080.8790.1390.1431.1541M/2M→2M0.8950.8950.8730.1030.1031.0441M/2M→1M/2M0.9380.9370.9130.1160.1181.0921M→2M0.6500.6600.6480.8050.5800.506

Figure tab_15: H
Type: table
Caption: 1: Performance on FLIP's GB1 landscape. * : Reference results from FLIP. Best and second best scores per split has been highlighted. histogram over normalized assay values for 51/69 dataset with multi-mutants (total fewer than 7500 sequences) can be seen in Figure I.1. The histograms are colored according to the number of mutations per variant. The assay distribution belong to different modalities depending on the number of mutations present, where double mutations often lead to a loss of fitness.
Data: Model1-vs-rest 2-vs-rest 3-vs-rest low-vs-highESM-1b (per AA)  *0.280.550.790.59ESM-1b (mean)  *0.320.360.540.13ESM-1b (mut mean)  *-0.080.190.490.45ESM-1v (per AA)  *0.280.280.820.51ESM-1v (mean)  *0.320.320.770.10ESM-1v (mut mean)  *0.190.190.800.49ESM-untrained (per AA)  *0.060.060.480.23ESM-untrained (mean)  *0.050.050.460.10ESM-untrained (mut mean)  *0.210.210.570.13Ridge  *0.280.590.760.34CNN  *0.170.320.830.51Levenshtein  *0.170.16-0.04-0.10BLOSUM62  *0.150.140.01-0.13Kermut-0.140.520.770.35Kermut (constant mean)0.370.550.770.36Baseline GP0.400.570.730.42I Histogram over assays for multi-mutant datasets

Figure tab_16: 
Type: table
Caption: Calibration metrics per domain for Kermut and the sequence kernel on ESM-2 embeddings. Random, modulo, and contiguous domains are from the ProteinGym substitution benchmark. Multiples corresponds to training and testing on both single and double mutants. Extrapolation corresponds to training on singles and predicting on doubles. 51 datasets with multi-mutants was used for the figure for all domains for comparability. The performance results for the multi-mutant setting can be found in TableG.1. Errorbars correspond to standard error.
Data: 0.05 0.10ECE0.1 0.2 0.3 0.4ENCE0.1 0.2 0.3c v0.000.00.0R a n d o m M o d u l o C o n t i g u o u s M u l t i p l e s E x t r a p o l a t i o nR a n d o m M o d u l o C o n t i g u o u s M u l t i p l e s E x t r a p o l a t i o n Kermut Baseline GPR a n d o m M o d u l o C o n t i g u o u s M u l t i p l e s E x t r a p o l a t i o nFigure J.1:

Figure tab_18: M
Type: table
Caption: 1: Performance using alternate zero-shot methods. The experiments are carried out on 174/217 datasets. * : ESM-2 in this table is equivalent to Kermut from the main results in Table 1.
Data: N.2 Hyperparameters vs. dataset sizeZero-shot predictorSpearman (↑)MSE (↓)Contig. Mod. Rand.Avg. Contig. Mod. Rand.Avg.EVE0.608 0.627 0.750 0.6620.731 0.682 0.412 0.608ESM-2  *0.605 0.628 0.743 0.6590.730 0.683 0.420 0.611GEMME0.605 0.622 0.744 0.6570.728 0.682 0.416 0.609VESPA0.606 0.623 0.742 0.6570.737 0.698 0.424 0.620TranceptEVE L0.600 0.619 0.744 0.6540.741 0.693 0.420 0.618ESM-IF0.583 0.606 0.738 0.6420.757 0.708 0.424 0.630ProteinMPNN0.575 0.599 0.734 0.6360.769 0.718 0.429 0.639Constant mean0.569 0.596 0.735 0.6330.770 0.718 0.429 0.639


Formulas:
Formula formula_0: H (x, x ′ ) = exp (-γ 1 d H (f IF (x), f IF (x ′ ))), with γ 1 > 0 [66]

Formula formula_1: k p (x, x ′ ) = k exp (f IF1 (x), f IF1 (x ′ )) = exp(-γ 2 ||f IF1 (x) -f IF1 (x ′ )||)

Formula formula_2: k 1 struct (x, x ′ ) = λk H (x, x ′ )k p (x, x ′ )k d (x, x ′ ),(1)

Formula formula_3: k struct (x, x ′ ) = i∈M j∈M ′ k 1 struct (x i , x ′j )(2)

Formula formula_4: k seq (x, x ′ ) = k SE (f 1 (x), f 1 (x ′ )) = k SE (z, z ′ ) = exp - ||z -z ′ || 2 2 2σ 2 .(3)

Formula formula_5: k(x, x ′ ) = πk struct (x, x ′ ) + (1 -π)k seq (x, x ′ ).(4)

Formula formula_6: H (p, q) = 1 √ 2 20 i=1 ( √ p i - √ q i ) 2 ,

Formula formula_7: H (x, x ′ ) = exp (-γ 1 d H (p, q)) = exp   -γ 1 1 √ 2 20 i=1 ( √ p i - √ q i ) 2   .

Formula formula_8: Y 1 + Y 2 , Y 3 + Y 4 ) = Cov(Y 1 , Y 3 ) + Cov(Y 1 , Y 4 ) + Cov(Y 2 , Y 3 ) + Cov(Y 2 , Y4

Formula formula_9: f 0 (x) = i∈M log p(x i ) -log p(x WT i )

Formula formula_10: X (x, x ′ ) := k Z (f (x), f (x ′ )) is a kernel on X .

Formula formula_11: k set (b, b ′ ) := x m ∈b,x ′m ∈b ′ λk 1 struct (x m , x ′m )

Formula formula_12: k d (x, x ′ ) = 1).

Formula formula_13: ECE = 1 K K i=1 |acc(i) -i|,

Formula formula_14: ENCE = 1 K K i=1 |RMV(i) -RMSE(i)| RMV(i) .

Formula formula_15: c v = N n=1 (σn-µσ) 2 N -1 µ σ ,
