Columnar Formatted Inverted Index for Highly-Paralleled, Vectorized Query Processing

Weichen Zhao, Minghao Zhao, Huiqi Hu, Weining Qian

Published: 01 Jan 2025, Last Modified: 11 Nov 2025ICDE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Inverted index is a basic tool in many data-intensive applications. Though numerous efforts have been made on efficient inverted index-based query processing, existing schemes do not achieve the expected performance for modern data centers, in which servers are equipped with powerful CPUs and relatively large memory. Through comprehensive measurement studies, we identify the root course is that the data formats for index representation make it unfeasible to design efficient query execution approaches on top of it, which results in poor parallel query support and waste CPU computation. Driven by the findings, we propose to reconcile the in-memory index as columnar structures. To enable this idea, we construct the compact columnar format (i.e., Cocoa) that achieves both desirable space efficiency and maintains the capability for efficient searching support. With Cocoa, we design an efficient query executing scheme that utilizes vectorized batch processing to avoid frequent branch prediction, as well as clause enumeration with pruning to save the overhead of intermediate batch materialization. We build an open-source system VeloSearch to embody our design; experimental results show that VeloSearch achieves ~30× better performance compared with state-of-the-art search libraries such as Lucene and Tantivy.

External IDs:dblp:conf/icde/ZhaoZHQ25