Keywords: COVID-19, SARS-CoV-2, spike protein, variants of concern, PhyloTransformer
TL;DR: A Transformer-based model that can model genetic mutations that may lead to viral reproductive advantage.
Abstract: In this article, we developed PhyloTransformer, a Transformer-based self-supervised discriminative model, which can model genetic mutations that may lead to viral reproductive advantage. We trained PhyloTransformer on 1,765,297 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences to infer fitness advantages, by directly modeling the nucleic acid sequence mutations. PhyloTransformer utilizes advanced techniques from natural language processing to enable efficient and accurate intra-sequence dependency modeling over the entire RNA sequence. We measured the prediction accuracy of novel mutations and novel combinations using our method and baseline models that only take local segments as input. We found that PhyloTransformer outperformed every baseline method with statistical significance. We also predicted the occurrence of mutations in each nucleotide of the receptor binding motif (RBM) and predicted modifications of N -glycosylation sites. We anticipate that the viral mutations predicted by PhyloTransformer may identify potential mutations of threat to guide therapeutics and vaccine design for effective targeting of future variants.