Abstract: The entity resolution problem requires finding pairs across datasets that belong to different owners but refer to the same entity in the real world. To train and evaluate solutions (either rule-based or machine-learning-based) to the entity resolution problem, generating a ground truth dataset with entity pairs or clusters is needed. However, such a data annotation process involves humans as domain oracles to review the plaintext data for all candidate record pairs from different parties, which inevitably infringes the privacy of data owners, especially in privacy-sensitive cases like medical records. To the best of our knowledge, there is no prior work on privacy-preserving ground truth labeling in the context of entity resolution. We propose a novel blind annotation protocol based on homomorphic encryption that allows domain oracles to collaboratively label ground truth without sharing data in plaintext with other parties. In addition, we design a domain-specific, user-friendly language that conceals the complex underlying homomorphic encryption circuits, making it more accessible and easier for users to adopt this technique. The empirical experiments indicate the feasibility of our privacy-preserving protocol (f-measure on average achieves more than 90\% compared with the real ground truth).
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jonathan_Ullman1
Submission Number: 4513
Loading