A Compact and Accurate Sketch for Estimating a Large Range of Set Difference Cardinalities

Published: 01 Jan 2024, Last Modified: 22 Nov 2024ICDE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Computing set difference cardinalities is a critical task in database optimization, network management, and anomaly detection. Due to the limited computational and mem-ory resources, exactly calculating set difference cardinalities becomes impractical in real-world applications. To solve this issue, sketch methods such as Odd sketch, Tug-of-War sketch, and HyperLogLog sketch can be extended to provide approximate estimations of set difference cardinalities. They use a family of hash functions to compress all elements in a set into a compact data structure. Unfortunately, Odd sketch suffers from limited estimation range, while Tug-of-War sketch and HyperLogLog sketch unavoidably face the problems of large estimation errors and high computational costs. In this paper, we design a novel data structure of bit array GXBits to fast and accurately estimate set difference cardinalities in a large range. In GXBits, the prob-ability of each bit recording its corresponding elements follows a variant of geometric distributions and varies across different bits. We conduct extensive experiments on synthetic datasets and real-world datasets. Experimental results demonstrate that our method GXBits is more computationally and memory efficient, and significantly increases the estimation accuracy of existing methods by up to 221.3 times.
Loading