Abstract: Min-Hash is a popular technique for efficiently estimating the Jaccard similarity of binary sets. Consistent Weighted Sampling (CWS) generalizes the Min-Hash scheme to sketch weighted sets and has drawn increasing interest from the community. Due to its constant-time complexity independent of the values of the weights, Improved CWS (ICWS) is considered as the state-of-the-art CWS algorithm. In this paper, we revisit ICWS and analyze its underlying mechanism to show that there actually exists dependence between the two components of the hash-code produced by ICWS, which violates the condition of independence. To remedy the problem, we propose an Improved ICWS (I <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> CWS) algorithm which not only shares the same theoretical computational complexity as ICWS but also abides by the required conditions of the CWS scheme. The experimental results on a number of synthetic data sets and real-world text data sets demonstrate that our I <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> CWS algorithm can estimate the Jaccard similarity more accurately, and also competes with or outperforms the compared methods, including ICWS, in classification and top-K retrieval, after relieving the underlying dependence.
0 Replies
Loading