LRCB: A Comprehensive Benchmark Evaluation of Reference-free Lossless Compression Tools for Genomics Sequencing Long Reads Data
Abstract: The advancement of long reads sequencing technologies has led to a significant increase in biological sequencing big data. Although several reference-free compressors are available for saving long reads data storage space, choosing the suitable one is challenging due to the shortage of thorough and systematic evaluations of their lossless compression effectiveness, both dedicated and general-purpose. In this study, we performed benchmark examinations on 30 compressors, including 11 specialized for long reads and 19 general-purpose ones, using 31 real-world datasets with differing sequencing platforms, species, and lengths. Each lossless compressor was evaluated on 13 performance measures, including compression strength, compression robustness, as well as time and peak memory required for compression and decompression. Additionally, for future long reads data compressors, we outlined investigation directions with consideration for privacy-sensitive sequences data security, hardware parallel acceleration, parameter tuning framework, and system hardware-algorithm integration design. We summarized the results as the Long Reads Compression Benchmark, available at https://github.com/fahaihi/LRCB.
Loading