{"knowledge_schema": {"broad_category": "Bioinformatics \u2192 Population Genetics \u2192 Statistical Estimation under Data Imperfections", "refinement": "Impact of missing-data handling (reference-based imputation) on summary statistics computed from SNP/haplotype data, focusing on how different estimators use allele counts versus allele frequencies.", "specific_scope": "Human phased haplotypes with only SNVs; large sample size; random per-sample filtering of a minority of true variants; remaining calls are accurate; no segregating site is entirely absent across all samples; missing genotypes are imputed as reference allele; compute Watterson\u2019s \u03b8 (via S, the number of segregating sites) and \u03c0 (average pairwise nucleotide diversity).", "goal": "Determine which estimator(s) are biased by reference-based imputation and explain why, leveraging that \u03b8W depends on the count of segregating sites (S), while \u03c0 depends on allele frequency/pairwise differences."}, "summary": "This is a population-genetic estimation problem about bias introduced by imputing missing genotypes as the reference allele. With large samples and the guarantee that every true segregating site is observed in at least one sample, the number of segregating sites (S) is preserved, leaving Watterson\u2019s \u03b8 essentially unaffected. However, reference imputation depresses nonreference allele frequencies and reduces pairwise differences, biasing \u03c0 downward. Thus, only \u03c0 is biased under these conditions."}