Normalized Compression Distance for DNA Classification

Gavin L. A. Hearne, Mohammad S. Refahi, Haozhe Neil Duan, James R. Brown, Gail L. Rosen

Published: 2024, Last Modified: 19 Feb 2025BCB 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The increasingly common use of next-generation sequencing has enabled greater access to large-scale (meta-)genomic datasets than ever before. The resulting deluge of data has made the quest for efficient DNA sequence classification methods an urgent challenge for downstream analyses. Traditional sequence alignment-based methods for DNA sequence classification struggle when presented with increasingly large volumes of sequence data due to the computational complexity of alignment. Subsequently, there is a need for methods capable of sequence identification without alignment. Normalized compression distance (NCD) has demonstrated capabilities in the field of text classification as a low-resource alternative to deep neural networks by leveraging compression algorithms to approximate Kolmogorov information distance. In an effort to apply this technique toward genomics tasks akin to tools such as Many-against-Many sequence searching (MMseqs) and Kraken2, we have explored the use of a gzip-based NCD towards both gene labeling of ORFs (open reading frames) and taxonomic classification of short reads. This demonstrates the efficacy of NCD in diverse multitask classification, and we further explore the capacity for NCD to classify larger libraries of metagenomic reads.