QMSpy: An Integrated Modular and Scalable Platform for Quantitative Metagenomics in Pyspark

Ari Ugarte, Minh Quang-Dao, Bich Hai Ho, Chloe Vigliotti, Eugeni Belda, Jean-Daniel Zucker, Edi Prifti

Published: 2019, Last Modified: 13 Nov 2024RIVF 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the advent of next-generation sequencing and the ever-growing interest in quantitative metagenomics, we are facing unprecedented computational bottlenecks to process the massive metagenomics datasets. Traditional metagenomics pipelines are composed of multiple stages, often coded separately in different languages and standards. The lack of integrated modularity makes their maintenance, evolution and scalability a challenge. Consequently, we designed QMSpy, a novel all-in-one platform that is both modular and scalable. Raw reads are transformed in distributed objects, which are processed, mapped against catalogs and smart-counted. Well-known tools such as Trimmomatic, Bowtie2, etc. are integrated in the platform using ad-hoc APIs to process the modular objects. QMSpy supports creating customizable workflows. It is implemented with the PySpark framework, extending the MapReduce model. It is deployable on most popular high-performance computing cluster such as Google Cloud, Amazon EC2, etc. Such scalability allows us, to propose several novel algorithms that improves gene, taxonomic and functional quantifications of microbial ecosystems. Finally, we benchmarked QMSpy with real and simulated data to establish the best hyperparameters. We believe that this resource will be of great value to the scientific community. Its modularity, scalability and the fact that it is open source will allow users to easily contribute and grow it further for a broader acceptance. Contact: e.prifti[at]ican-institute.org.