Abstract: Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of B bits. With k hashes for each data vector, the storage would be BXk bits; and when used for large-scale learning, the model size would be 2B X k, which can be expensive. A standard strategy is to use only the lowest b bits out of the B bits and somewhat increase k, the number of hashes. In this study, we propose to re-use the hashes by partitioning the B bits into m chunks, e.g., b X m =B. Correspondingly, the model size becomes m X 2b X k, which can be substantially smaller than 2BX k.The proposed "partitioned b-bit hashing'' (Pb-Hash) is desirable for various reasons: (1) Generating hashes can be expensive for industrial-scale (user-facing) systems. Thus, engineers may hope to make use of each hash as much as possible, instead of generating more hashes (i.e., increasing k). (2) To protect user privacy, the hashes might be artificially "polluted'' and the differential privacy (DP) budget is proportional to k. (3) After hashing, the original data are not necessarily stored and hence it might not be even possible to generate more hashes. (4) For advertising and recommendation, engineers can also apply Pb-Hash to large categorical (ID) features.Our theoretical analysis reveals that by partitioning the hash values into m chunks, the accuracy would drop. In other words, using m chunks of B/m bits would not be as accurate as directly using B bits. This is due to the correlation from re-using the same hash. On the other hand, our analysis also shows that the accuracy would not drop much for (e.g.,) m=2 ∼ 4. In some regions, Pb-Hash still works well even for m much larger than 4. We expect Pb-Hash would be a good addition to the family of hashing methods/applications and benefit industrial practitioners. Finally, we verify the effectiveness of Pb-Hash for linear SVM models as well as deep learning models.
Loading