GenStore: In-Storage Filtering of Genomic Data for High-Performance and Energy-Efficient Genome Analysis

Nika Mansouri-Ghiasi, Jisung Park, Harun Mustafa, Jeremie S. Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, Rachata Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, Onur Mutlu

Published: 2022, Last Modified: 16 May 2025ISVLSI 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Genome sequence analysis, which analyzes the DNA sequences of organisms, is important for many applications in personalized medicine [1]–[8], outbreak tracing [9]–[14], and evolutionary studies [15]–[21]. The information of an organism's DNA is converted to digital data via a process called sequencing. A sequencing machine extracts the sequences of DNA molecules from the organism's sample in the form of strings consisting of four base pairs $(bps)$ , denoted by $\mathrm{A}, \mathrm{C}, \mathrm{G}$ , and T. No current sequencing technology has the capability to read a human DNA molecule in its entirety. Instead, state-of-the-art sequencing machines generate randomly sampled, inexact sub-strings of the original genome, called reads. The information about the corresponding location of each read in the complete genome is lost during sequencing in most technologies. State-of-the-art sequencing machines produce one of two kinds of reads. 1) Short read sequencing technologies, such as Illumina [22], [23], produce reads that are highly accurate (99-99.9%) [24]–[26], but short (e.g., up to a few hundred DNA base pairs [24], [27], [28]). 2) Long read sequencing technologies, such as Pacific Biosciences (PacBio) [29] and Oxford Nanopore Technologies (ONT) [30], produce reads that are less accurate (85-90%) [27,31–33], but long (e.g., lengths ranging from thousands to millions of base pairs [34]).