Practical High-Order Entropy-Compressed Text Self-Indexing

Hongwei Huo, Peng Long, Jeffrey Scott Vitter

Published: 01 Mar 2023, Last Modified: 15 Jan 2026IEEE Transactions on Knowledge and Data EngineeringEveryoneRevisionsCC BY-SA 4.0

Abstract: Compressed self-indexes are used widely in string processing applications, such as information retrieval, genome analysis, data mining, and web searching. The index not only indexes the data, but also encodes the data, and it is in compressed form. Moreover, the index and the data it encodes can be operated upon directly, without need to uncompress the entire index, thus saving time while maintaining small storage space. In some applications, such as in genome analysis, existing methods do not exploit the full possibilities of compressed self-indexes, and thus we seek faster and more space-efficient indexes. In this paper, we propose a practical high-order entropy-compressed self-index for efficient pattern matching in a text. We give practical implementations of compressed suffix arrays using a hybrid encoding in the representation of the neighbor function $\Phi$. We analyze the performance in theory and practice of our recommended indexing method, called ${{\sf GeCSA}}$. We can improve retrieval time further using an iterated version of the neighbor function. Experimental results on the tested data demonstrate that the proposed index ${{\sf GeCSA}}$ has good overall advantages in space usage and retrieval time over the state-of-the-art indexing methods, especially on the repetitive data.

External IDs:doi:10.1109/tkde.2021.3114401