Alignment-based and protein foundation models for viral evolution, vaccines and vectors

Published: 13 Oct 2024, Last Modified: 01 Dec 2024AIDrugX SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: protein language models, viral evolution, vaccines
TL;DR: The first comprehensive evaluation of mutation effect prediction for viruses from protein language models and alignment-based models.
Abstract: Protein mutation effect prediction has advanced several fields, from designing enzymes to forecasting viral evolution. These models are typically trained on sequence data, structural data, or a combination of both. Sequence-based models learn constraints governing protein structure and function from sequence data and fall broadly into two categories: alignment-based models and protein language models (PLMs). We provide the first detailed comparison of modeling approaches specifically for viruses. We curated a dataset of over 59 standardized viral deep mutational scanning assays and assessed the relative performance of three alignment-based models (PSSM, EVmutation, and EVE), three PLMs (ESM-1v, Tranception, and VESPA), and two versions of SaProt, a structurally-aware protein language model (SaProt-AF2 and SaProt-PDB). Interestingly, deeper alignments often led to worse performance for alignment-based models. On the other hand, PLMs tended to perform better as the size of the training database increased. Overall, alignment-based models outperformed sequence-only PLMs, while the best alignment based model performed on par with SaProt-PDB. Our findings suggest that modeling strategies that are effective for other taxa may not translate directly to viruses, likely due to the limited number of viral sequences used for training. However, incorporating additional virus-specific data into PLMs could enhance their predictive power for viral mutation effects, important for understanding viral evolution and the design of vaccines and viral vectors.
Submission Number: 17
Loading