Keywords: genome assembly, graph neural networks, assembly graph, path finding
Abstract: A quest to determine the human DNA sequence from telomere to telomere started three decades ago and was finally finished in 2021. This accomplishment was a result of a tremendous effort of numerous experts with an abundance of data, various tools, and often included manual inspection during genome reconstruction. Therefore, such method could hardly be used as a general approach to assembling genomes, especially when the assembly speed is important. Motivated by this achievement and aspiring to make it more accessible, we investigate a previously untaken path of applying geometric deep learning to the central part of the genome assembly---untangling a large assembly graph from which a genomic sequence needs to be reconstructed. A graph convolutional network is trained on a dataset generated from human genomic data to reconstruct the genome by finding a path through the assembly graph. We show that our model can compute scores from the lengths of the overlaps between the sequences and the graph topology which, when traversed with a greedy search algorithm, outperforms the greedy search over the overlap lengths only. Moreover, our method reconstructs the correct path through the graph in the fraction of time required for the state-of-the-art de novo assemblers. This favourable result paves the way for the development of powerful graph machine learning algorithms that can solve the de novo genome assembly problem much quicker and possibly more accurately than human handcrafted techniques.
One-sentence Summary: We train a graph convolutional network to find a path through an assembly graph, which could reduce fragmentation and execution time of the existing genome assemblers.
Supplementary Material: zip
13 Replies
Loading