Abstract: Data similarity computation is a fundamental research topic which underpins many high-level applications based on similarity measures. However, the exact similarity computation has become daunting in large-scale real-world scenarios. Currently, MinHash is a popular technique for efficiently estimating the Jaccard similarity of binary sets and, furthermore, weighted MinHash is utilized to estimate the generalized Jaccard similarity of weighted sets. This review focuses on categorizing and discussing the existing works of weighted MinHash algorithms. Also, we have developed a Python toolbox for the algorithms, and released it in our github.
0 Replies
Loading