Keywords: hash, data compression, privacy
Abstract: Minwise hashing (MinHash) is a standard hashing algorithm for large-scale search and learning with the binary Jaccard similarity. One permutation hashing (OPH) is an effective and efficient alternative of MinHash which splits the data into K bins and generates hash values within each bin. In this paper, to protect the privacy of the output sketches, we combine differential privacy (DP) with OPH, and propose DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on the densification strategy to deal with empty bins in OPH. Detailed algorithm design and privacy and utility analysis are provided. The proposed DP-OPH methods significantly improves the DP minwise hashing (DP-MH) alternative in the literature. Experiments on similarity search confirm the effectiveness of our proposed algorithms. We also provide an extension to real-value data, named DP-BCWS, in the appendix.
Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13876
Loading