Half-Xor: A Fully-Dynamic Sketch for Estimating the Number of Distinct Values in Big Tables

Published: 01 Jan 2024, Last Modified: 10 Feb 2025IEEE Trans. Knowl. Data Eng. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Calculating the number of distinct values (i.e., NDV) in a column of a big table is costly yet fundamental to a variety of database applications such as data compression and profiling. To reduce the high time and space cost, a number of sketch methods (e.g., HyperLogLog) have been proposed, which estimate the NDV from a constructed compact data summary of distinct values. However, these methods fail or are costly to manage fully-dynamic scenarios where data is often inserted into and deleted from the table. To solve this issue, we propose a novel sketch method, Half-Xor . Our Half-Xor sketch consists of a compact bit matrix and a small counter array, and it needs to set a few bits and update a counter when handling a data insertion/deletion. Compared with the state-of-the-art mergeable method, our experimental results demonstrate that our method Half-Xor is up to 6.6 times more accurate under the same memory usage and reduces the memory usage by up to 16 times to achieve the same estimation accuracy.
Loading