Abstract: Sequence sketching—a class of techniques aimed at generating compact representations of longer sequences—has become widely used in numerous long read applications, including assembly and mapping. Instead of comparing sequences, sketches allow us to sample from a subspace of k-mers and use those samples for comparison, saving both time and memory in the end application. One of the important metrics that determines the performance of a sketch is the sketch density, which refers to the fraction of the sampled k-mers retained by the sketch. While a lower density is preferable for space considerations, it could also impact the sensitivity of the mapping process. In this work, we visit the problem of reducing sketch density while preserving accuracy in the context of long-read mapping. We present an efficient algorithm called MHsketch that uses Jaccard estimators to reduce sketch density in mapping applications. Starting from an initial ground set of k-mers generated through a sketching method of choice, the approach applies MinHashing to derive a smaller sketch and uses that for mapping. In addition to reducing density, this approach is also easily parallelizable. To demonstrate the efficacy of our method, we modified a recently developed long read mapping tool (JEM-mapper) to adopt different sketching schemes, including Syncmer and Strobemer, and incorporated MHsketch to evaluate the effectiveness of downsampling. Experimental evaluation demonstrates the ability of our approach to significantly reduce density and reap performance benefits from it. In particular, our experiments reveal that MHsketch (syncmers) achieves high-quality mapping while reducing time-to-solution (speedups between \(2.2\times \) to \(9.1\times \)), and drastically reducing memory usage (\(>90\%\) savings) compared to state-of-the-art tools. Availability: https://github.com/TazinRahman1105050/MHsketch.
External IDs:dblp:journals/bmcbi/RahmanK26
Loading