Adaptive Lossless Compression for Genomics Data by Multiple (s, k)-mer Encoding and XLSTM

Published: 2025, Last Modified: 14 Jan 2026ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Learning-based lossless compressors have been validated to have competitive advantages in genomics data (GD) compression. However, learning-based GD-dedicated compressors typically need to be pre-trained on multi-source data and then are directly used to compress another target data, we denote them as static compressors, and they often face two challenges: limited compression ratios and bad-performed generalization due to data distribution variations. To solve these problems, we propose AGDLC, a novel Adaptive Genomics Data Lossless Compressor. It includes two critical designs: 1) We design a multiple (s, k)-mer mixer for extracting GD redundancy from multiple dimensions to improve compression ratios. 2) We introduce a recently popular XLSTM model as the backbone, which adaptively compresses GD while updating parameters, without pre-training, improving compression ratios and compression generalization at the same time. We compare AGDLC with 13 baselines on 7 real-world datasets, and the experimental results demonstrate that it achieves the best compression ratio with an average improvement of 2.162%-69.436%. The codes can be found at https://github.com/dingyanfeng/AGDLC.
Loading