Abstract: Whereas protein language models have demonstrated remarkable efficacy in predicting the
effects of missense variants, DNA counterparts have not yet achieved a similar competitive
edge for genome-wide variant effect predictions, especially in complex genomes such as that of
humans. To address this challenge, we here introduce GPN-MSA, a novel framework for DNA
language models that leverages whole-genome sequence alignments across multiple species and
takes only a few hours to train. Across several benchmarks on clinical databases (ClinVar,
COSMIC, and OMIM) and population genomic data (gnomAD), our model for the human
genome achieves outstanding performance on deleteriousness prediction for both coding and
non-coding variants.
Loading