Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We improve disease prediction on GWAS data by developing a methodology that allows us to fit flexible neural networks.
Abstract: To understand how genetic variants in human genomes manifest in phenotypes - traits like height or diseases like asthma - geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound proteins. As more data and features become available, one might expect predictive models to improve. Unfortunately, training these models is bottlenecked by the need to solve expensive linear algebra problems because variants in the genome are correlated with nearby variants, requiring inversion of large matrices. Previous methods have therefore been restricted to fitting small models, and fitting simplified summary statistics, rather than the full likelihood of the statistical model. In this paper, we leverage modern fast linear algebra techniques to develop DeepWAS (Deep genome Wide Association Studies), a method to train large and flexible neural network predictive models to optimize likelihood. Surprisingly, we find that larger models only improve performance when using our full likelihood approach; when trained by fitting traditional summary statistics, larger models perform no better than small ones. We find larger models trained on more features make better predictions, potentially improving disease predictions and therapeutic target identification.
Lay Summary: Geneticists want to be able to predict what diseases someone is at risk of from the variants in their genome. They can do so by measuring the disease and genomes of hundreds of thousands of people to learn which variants are correlated to disease. Unfortunately there are many more variants than study participants, so they can’t pinpoint exactly which variants cause disease. Luckily many variants are known to lie in inactive regions of the genome, allowing us to ignore them to focus on the variants more likely to cause disease. In this paper we suggest we can do even better by building a more flexible neural network model that predicts how likely a variant is to contribute to disease based on its genomic region. We solved a few algorithmic challenges that made it very hard to train such a model previously. We build our flexible models and show they better predict disease than previous smaller models.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/AlanNawzadAmin/DeepWAS
Primary Area: Probabilistic Methods->Bayesian Models and Methods
Keywords: GWAS, Transformers, Iterative Methods, Machine Learning, Numerical Linear Algebra
Flagged For Ethics Review: true
Submission Number: 4982
Loading