Deep learning approaches for predicting pathogenic potentials of novel DNA and RNA sequences

Jakub M. Bartoszewicz

Published: 2022, Last Modified: 09 Feb 2024undefined 2022Readers: Everyone

Abstract: Regular emergence of novel pathogens is one of the greatest threats to global health. DNA and RNA sequencing enable detection of new viruses and microbes, but standard approaches for computational analysis of sequencing data rely on predefined lists of known agents. New pathogens, with genomes highly divergent from available references, remain difficult to recognize. This problem can be alleviated by training classifiers predicting whether a given sequencing read originates from a possibly novel pathogen. I show that deep neural networks invariant to DNA reverse-complementarity markedly outperform alternatives based on other machine learning algorithms and homology detection by sequence alignment. This holds for both bacteria and viruses. I introduce new methods enabling analysis and visualization of the learned patterns, as well as identification of sequences, genes and genomic regions associated with high pathogenic potential. Modified ResNet architectures combined with real-time mapping of short reads can accurately recognize both known and novel threats as the sequencer is running. Analogous models also work for short fragments of long reads, corresponding to just 0.5 s of sequencing time. I then describe a manually curated database of fungal pathogen genomes facilitating detection of novel threats with both machine learning and alternative approaches. I use learned numerical representations of the genomes in the database to visualize the relationship between taxonomy and the pathogenic phenotype. Finally, I employ the developed neural architectures to classify reads sampled from mixtures of different novel bacteria, viruses, and fungi. The methods presented here are implemented in the DeePaC and DeePaC-Live packages. They can be easily reused for training, evaluation, and deployment of deep neural networks for DNA and RNA sequences. Although the main focus is placed on identification of emerging pathogens from sequencing data, presented approaches could also be used to screen synthetic sequences and detect engineered threats. The trained networks are capable of predicting abstract, complex traits directly from sequences, without directly relying on close taxonomic matches. In the future, similar 'phenotype models' could find many alternative applications in rapid diagnostics, public health and synthetic biology.

0 Replies