Sketching and Sequence Alignment: A Rate-Distortion Perspective

Ilan Shomorony, Govinda M. Kamath

2021 (modified: 24 Apr 2023)ISIT 2021Readers: Everyone

Abstract: Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. A standard approach to speed up this task is to compute “sketches” of the DNA reads (typically via hashing-based techniques) that allow the efficient computation of pairwise alignment scores. We propose a rate-distortion framework to study the problem of computing sketches that achieve the optimal tradeoff between sketch size and alignment estimation distortion. We consider the simple setting of i.i.d. error-free sources of length <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$n$</tex> and introduce a new sketching algorithm called “locational hashing.” While standard approaches in the literature based on min-hashes require <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$B=(1/D)\cdot O(\log n)$</tex> bits to achieve a distortion <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$D$</tex> , our proposed approach only requires <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$B=\log^{2}(1/D)\cdot O(1)$</tex> bits. This can lead to significant computational savings in pairwise alignment estimation.

0 Replies