PRSformer: Disease Prediction from Million-Scale Individual Genotypes

Payam Dibaeinia; Chris German; Suyash Shringarpure; Adam Auton; Aly A Khan

PRSformer: Disease Prediction from Million-Scale Individual Genotypes

Payam Dibaeinia, Chris German, Suyash Shringarpure, Adam Auton, Aly A Khan

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multitask Learning, Transformer Architectures, Neighborhood Attention, Deep learning for Healthcare

TL;DR: PRSformer: A scalable Transformer using neighborhood attention for multitask disease prediction from million-scale individual genotypes, showing non-linear modeling benefits at large N.

Abstract: Predicting disease risk from DNA presents an unprecedented emerging challenge as biobanks approach population scale sizes ($N>10^6$ individuals) with ultra-high-dimensional features ($L>10^5$ genotypes). Current methods, often linear and reliant on summary statistics, fail to capture complex genetic interactions and discard valuable individual-level information. We introduce **PRSformer**, a scalable deep learning architecture designed for end-to-end, multitask disease prediction directly from million-scale individual genotypes. PRSformer employs neighborhood attention, achieving linear $O(L)$ complexity per layer, making Transformers tractable for genome-scale inputs. Crucially, PRSformer utilizes a stacking of these efficient attention layers, progressively increasing the effective receptive field to model local dependencies (e.g., within linkage disequilibrium blocks) before integrating information across wider genomic regions. This design, tailored for genomics, allows PRSformer to learn complex, potentially non-linear and long-range interactions directly from raw genotypes. We demonstrate PRSformer's effectiveness using a unique large private cohort ($N \approx 5$M) for predicting 18 autoimmune and inflammatory conditions using $L \approx 140$k variants. PRSformer significantly outperforms highly optimized linear models trained on the *same individual-level data* and state-of-the-art summary-statistic-based methods (LDPred2) derived from the *same cohort*, quantifying the benefits of non-linear modeling and multitask learning at scale. Furthermore, experiments reveal that the advantage of non-linearity emerges primarily at large sample sizes ($N > 1$M), and that a multi-ancestry trained model improves generalization, establishing PRSformer as a new framework for deep learning in population-scale genomics.

Primary Area: Machine learning for sciences (e.g. climate, health, life sciences, physics, social sciences)

Submission Number: 17725

Loading