Continuous Embeddings of DNA Sequencing Reads and Application to Metagenomics

Romain Menegaux, Jean-Philippe Vert

2019 (modified: 12 May 2023)J. Comput. Biol. 2019Readers: Everyone

Abstract: We propose a new model for fast classification of DNA sequences output by next-generation sequencing machines. The model, which we call fastDNA, embeds DNA sequences in a vector space by learning continuous low-dimensional representations of the k-mers it contains. We show on metagenomics benchmarks that it outperforms the state-of-the-art methods in terms of accuracy and scalability.

0 Replies