Abstract: Deep learning approaches to genomic variant calling are increasingly reported in the literature, often with striking accuracy improvements claimed over classical pipelines. We examine the methodology underlying such claims through a four-pronged case study built around a single binary classifier on the Genome in a Bottle (GIAB) HG001 benchmark. An initial analysis of our own pipeline produced an apparently rigorous result, synthetic-data $F_1 = 0.994$ for Focal Loss versus $0.975$ for binary cross-entropy, with a precision-dependent training-collapse pattern (24\%/18\%/0\% across FP32/BF16/FP16) on real GIAB data over 50 random seeds. A subsequent detailed analysis tested each load-bearing component independently. We find that (i) the synthetic-to-real generalization gap is severe and inverts the loss-function ranking; on real data Focal Loss collapses to $F_1=0$ while BCE achieves $F_1 \in [0.27, 0.34]$; (ii) the proposed mechanism explaining the precision-collapse pattern (gradient-noise-as-implicit-regularization) fails under controlled testing with both round-to-nearest and stochastic rounding; (iii) the feature pipeline used in the initial analysis contains a structural label leak by construction; and (iv) the precision-collapse pattern itself does not survive faithful re-implementation: across 150 trainings (50 seeds $\times$ 3 precisions $\times$ 30 epochs), zero collapses occur in any precision (Fisher exact $p < 10^{-3}$ versus the initial counts). Each individual finding has a plausible benign explanation; their conjunction in a methodology that appeared rigorous is the contribution of this work. We articulate four specific evaluation pitfalls implied by the case study and propose a minimal protocol to detect them prospectively.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Polina_Kirichenko1
Submission Number: 8812
Loading