Coded Shotgun Sequencing

Aditya Narayan Ravi, Alireza Vahid, Ilan Shomorony

2022 (modified: 24 Apr 2023)IEEE J. Sel. Areas Inf. Theory 2022Readers: Everyone

Abstract: Most DNA sequencing technologies are based on the shotgun paradigm: many short reads are obtained from random unknown locations in the DNA sequence. A fundamental question, in Motahari <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">et al.</i> , (2013), is what read length and coverage depth (i.e., the total number of reads) are needed to guarantee reliable sequence reconstruction. Motivated by DNA-based storage, we study the coded version of this problem; i.e., the scenario where the DNA molecule being sequenced is a codeword from a predefined codebook. Our main result is an exact characterization of the capacity of the resulting <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">shotgun sequencing channel</i> as a function of the read length and coverage depth. In particular, our results imply that, while in the uncoded case, <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$O(n)$ </tex-math></inline-formula> reads of length greater than <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$2 \log n$ </tex-math></inline-formula> are needed for reliable reconstruction of a length- <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> binary sequence, in the coded case, only <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$O(n/\log n)$ </tex-math></inline-formula> reads of length greater than <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\log n$ </tex-math></inline-formula> are needed for the capacity to be arbitrarily close to 1.

0 Replies