Abstract: Studying the genomic diversity of viruses can help us understand how viruses evolve and how that evolution can impact human health. Rather than use a laborious and tedious wet-lab approach to conduct a genomic diversity study, we take a computational approach, using the ubiquitous NCBI BLAST and our parallel and distributed SparkLeBLAST, across 53 patients (~40,000,000 query sequences) on Fugaku, the world's fastest homogeneous supercomputer with 158,976 nodes, where each code contains a 48-core A64FX processor and 32 GB RAM. To project how long BLAST and SparkLeBLAST would take to complete a genomic diversity study of COVID-19, we first perform a feasibility study on a subset of 50 query sequences from a single COVID-19 patient to identify bottlenecks in sequence alignment processing. We then create a model using Amdahl's law to project the run times of NCBI BLAST and SparkLeBLAST on supercomputing systems like Fugaku. Based on the data from this 50-sequence feasibility study, our model predicts that NCBI BLAST, when running on all the cores of the Fugaku supercomputer, would take approximately 26.7 years to complete the full-scale study. In contrast, SparkLeBLAST, using both our query and database segmentation, would reduce the execution time to 0.026 years (i.e., 22.9 hours) – resulting in more than a 10,000× speedup over using the ubiquitous NCBI BLAST.
Loading