Keywords: Genome Assembly, Geometric Deep Learning, Contrastive Learning, Clustering, Graph
TL;DR: We present the first deep learning tool for Hi-C-based phasing, and the first method to perform phasing at the unitig level.
Abstract: Accurate haplotype phasing is essential for high-quality genome assembly, yet de novo phasing without parental data for complex genomes remains a challenge. We formulate phasing as a binary, overlapping node clustering problem on unitig graphs where nodes represent contiguous, non-branching DNA sequence fragments and different edge types capture sequence overlaps as well as Hi-C proximity information. To solve this problem, we design a contrastive learning framework with custom objective functions and train a graph-transformer-based model termed grapHiC to distinguish nodes with paternal, maternal, or homozygous haplotypes. We show that grapHiC significantly outperforms other node clustering methods on genome-sized datasets and that grapHiC’s predictions can successfully guide de novo genome assembly, producing well-phased assemblies across diverse human genome assembly graphs using the DipGNNome assembler. Our code, trained model, and dataset are available at https://anonymous.4open.science/r/graphic_iclr-688D/ (repository anonymized for peer review).
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 4388
Loading