What Survives Privatization? A Guide to Structure and Utility in Differentially Private Genome-Wide Association Studies

TMLR Paper8094 Authors

25 Mar 2026 (modified: 26 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Single nucleotide polymorphisms (SNPs) are among the most common and informative forms of genetic variation in the human genome and constitute the primary data representation used in genome-wide association studies (GWAS). Due to their extreme dimensionality, strong correlation structure, and the presence of both population-level and familial dependencies, SNP datasets exhibit structural properties that fundamentally distinguish them from standard tabular data. At the same time, genomic data is uniquely sensitive; it is immutable, identifying, and shared across relatives, and has been shown to be vulnerable to a wide range of attacks, including membership inference, reconstruction, and kinship inference. As a result, protecting SNP data has become a critical and practically unavoidable requirement. Differential privacy (DP) provides a rigorous mathematical framework for protecting sensitive data under strong adversarial assumptions. However, in the context of GWAS, the design and evaluation of meaningful DP mechanisms crucially depend on understanding the biological, statistical, and structural properties of SNP data and the downstream analysis pipelines. For a typical privacy researcher, acquiring even the minimal domain knowledge required to reason correctly about the structure of genomic data and the associated analysis pipelines represents a substantial and time-consuming barrier. Yet, without this understanding, progress in private genomic data analysis risks being misguided or misleading. This survey explicitly bridges this gap. We provide a structured, self-contained primer on the structural properties of SNP data and the core analytical workflows of GWAS, focusing on the aspects most consequential for privacy definitions, mechanism design, and utility. Building on this foundation, we present the first comprehensive and systematic overview of differentially private methods for SNP datasets. We organize the literature through a release-oriented taxonomy that reframes existing approaches in terms of what survives privatization, revealing the design choices and trade-offs that shape their scientific and practical utility. Finally, we identify key open challenges arising from mismatches between existing differential privacy methodologies and the scientific, statistical, and operational realities of genomic data analysis, and outline future research directions toward principled and deployable privacy-preserving GWAS.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Amartya_Sanyal1
Submission Number: 8094
Loading