A Distributed Alignment-free Pipeline for Human SNPs Genotyping

Lorenzo Di Rocco, Umberto Ferraro Petrillo

Published: 2023, Last Modified: 26 Jul 2024BCB 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Identification of known genetic traits and disease-related variants within an individual requires a fundamental task: genotyping a set of variants from a database. However, the efficiency of this process is challenged by the growing volume of sequencing data and variant databases. At such scale, even the fastest genotyping tool available can deliver a result in a time that is unacceptable.To address this issue, we present SparkGeno, the first known distributed alignment-free pipeline for genotyping the particular case of Single Nucleotide Polymorphisms (SNPs). Building upon a distributed reformulation of traditional alignment-free genotyping pipelines, and using the Apache Spark framework, we introduce several optimizations to further enhance the performance of our code in a distributed environment. Our pipeline comes in two versions that employ different data structures, making them suitable for processing datasets featuring different numbers of SNPs.Moreover, we present the results of an experimental analysis on widely studied datasets to assess how relying on distributed computing allows for a fast, accurate and scalable solution for large-scale genotyping. Finally, we also report the results of an additional experiment for validating the effectiveness of the signature-based approach we used to perform genotyping.Our results show that SparkGeno, when run on a distributed system, is able to genotype variants from whole-genome sequencing data orders of growth faster than existing tools, in a scalable manner in terms of the number of the available computational units. This makes SparkGeno a promising solution for large-scale genotyping applications, such as precision medicine and population-scale studies.