GPC: Deep generative model of genetic variation data improves imputation accuracy in private populations
Track: Full / long paper (5-8 pages)
Keywords: Deep Learning, Machine Learning, Generative AI, Synthetic Data, Imputation, Population Genetics, Haplotype
TL;DR: We introduce Genetic Probabilistic Circuits (GPC), a tractable deep generative model that generates realistic artificial genomes and enables direct genotype imputation with improved accuracy for underrepresented populations.
Abstract: Artificial genomes (AGs) are increasingly used to benchmark genomic pipelines, test population genetic hypotheses, and construct reference panels for genotype imputation, while avoiding restrictions associated with sharing real genomes. However, existing approaches often struggle to jointly achieve realism, computational efficiency, and privacy preservation. We introduce Genetic Probabilistic Circuits (GPC), a deep generative model for genetic variation data based on hidden Chow--Liu trees represented as probabilistic circuits. GPC captures long-range dependencies among SNPs and is simple to train. We evaluate GPC across multiple ancestries in two large-scale datasets, the 1000 Genomes Project and UK Biobank. GPC matches or exceeds prior methods in generating AGs that resemble real genomes with the AGs retaining population structure underlying the training genomes. The AGs from GPC more faithfully reproduce patterns of linkage disequilibrium (LD; correlations between nearby genetic variants) across length scales. We also find that GPC consistently improves imputation accuracy by 3--33\% in $r^2$ over the next best generative model, with gains of 13--279\% for low-frequency variants (MAF $<$1\%). For underrepresented populations, GPC improves accuracy by 12--96\% over European-only reference panels. Finally, we demonstrate that GPC provides improved privacy-utility tradeoffs compared to existing approaches, enabling accurate inference when sharing real genomes is restricted.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 26
Loading