A Unified Framework for Processing Exact and Approximate Top-k Set Similarity Join

Cihai Sun, Hongya Wang, Yingyuan Xiao, Zhenyu Liu

Published: 2020, Last Modified: 05 Nov 2025APWeb/WAIM (2) 2020EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: An interesting observation was made that only a few (far shorter than the prefix) low-frequency tokens are enough to help finding similarity pairs for processing top-k set joins. This phenomenon is ubiquitous in all real datasets we have experimented with, covering domains as varied as text, social network, protein sequence data. Possible explanations are discussed. Based on this observation, we propose an algorithm called AEtop-k for processing both approximate and exact top-k similarity join in a unified framework. Comprehensive experiments demonstrate that, compared with the state-of-the-art algorithm on a large collection of real-life datasets, the approximate version of our algorithm can achieve up to 10000\(\times \) speedup with little sacrifice on accuracy and the exact version runs up to 5\(\times \) faster than the existing algorithm.