Bi-Directional Context Modeling with Combinatorial Structuring for Genome Sequence Compression

Wenrui Dai, Hongkai Xiong

2015 (modified: 16 May 2022)DCC 2015Readers: Everyone

Abstract: Summary form only given. This paper proposes a bi-directional context modeling (BCM) technique for reference-free genome sequence compression, which constructs its contexts by combining arbitrary predicted symbols in two directions corresponding to approximate repeats and non-repeat regions. Thus, BCM can sequentially predict DNA sequences with weighted conditional probabilities that simultaneously exploit the correlations among matched approximate repeats and fit the variable-order statistics in non-repeat regions. Moreover, BCM eliminates the overhead of pointer information for specifying approximate repeats, as it is synchronized in both encoder and decoder. In theory, we show that upper bounds of excess model redundancy led by BCM vanish with the growth of sequence size. Experimental results show that BCM outperforms the state-of-the-art reference-free compressors like FCM and CTW+LZ.

0 Replies