Abstract: Inverted indexes are commonly utilized in large-scale search engines to store lists of document identifies (docIDs) relevant to query terms, which are queried maybe thousands of times per second. Traditionally, optimized integer sequence encoding methods are applied to compress the inverted index while simultaneously maintaining reasonable query processing speeds. Recently, a context-free grammar-based method was introduced for inverted index compression, which is particularly useful for highly repetitive indexes. Due to the high time and space cost of the traditional grammar generation (transform) algorithms designed for large inverted index collections with much redundancy, we propose a parallel generation algorithm for context-free grammar generation. We further propose a greedy dictionary pruning algorithm to reduce cache misses in query processing. We also implement encoding, list intersection, and WAND querying on the grammar index. Experimental results indicate that parallel grammar generation algorithm achieves a super-linear speedup with minor data overhead and nearly identical query efficiency compared to the single-threaded algorithm. For example, with 10 threads to process the data set, a speedup about 75 times faster is obtained with only $$4.3\%$$ data overhead. Moreover, parallel grammar generation incurs negligible impact on query processing efficiency.
0 Replies
Loading