DNA language models identify variants predictive across the human phenome

Published: 04 Mar 2024, Last Modified: 27 Apr 2024MLGenX 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: DNA language models, Foundation models, Polygenic risk models, Population genetics, Model application
TL;DR: DNA language models can be used to identify variants for disease onset prediction in a large population cohort.
Abstract: Early identification of individuals at high risk for diseases is crucial to public health, facilitating timely prevention and treatment strategies. Polygenic scores (PGS) offer significant clinical promise by estimating the genetic predisposition to diseases, yet their current impact is limited by insufficient power, especially for rare variants and diseases. While larger cohorts may enhance the power of PGS, advancements in methodology are equally critical. Recently, DNA language models, serving as foundational models for genomic data, have shown impressive capabilities in tasks such as predicting epigenetic marks, identifying regulatory sequences, and annotating variant effects. Yet, their utility beyond local variant effects has not been explored to date. Here, we use the GPN-MSA and Nucleotide Transformer DNA language models to predict the relationship between genetic variants and disease risk. We use variant-level embeddings to predict the potential of variants to influence a wide range of phenotypes and show that variant sets with high scores are more predictive of diseases across the human phenome than baseline variant sets. While prior work on DNA language models has primarily focused on local variant effects, our work demonstrates their value in genome-wide variant selection, potentially complementing genome-wide association studies (GWAS) and polygenic scores by learning representations that can be used to identify rare variants with large effect sizes. Our results highlight the potential of DNA language models in identifying genotype-phenotype associations.
Submission Number: 55
Loading