Language Models Can Guess Your Identities from De-identified Clinical Notes

OpenReview Anonymous Preprint Submission577 Authors

17 Apr 2024Anonymous Preprint SubmissionEveryoneCC BY-NC-ND 4.0

Keywords: de-identification, re-identification, clinical notes, EHR, privacy, language model, HIPAA, NER

TL;DR: Even with de-identified clinical notes, non-trivial risks of re-identifying patients remain both theoretically (via backdoor paths in causal graphs) and experimentally (three in a thousand), posing challenges to current data sharing practices.

Abstract:

Although open data accelerates research by promoting reproducibility and benchmarking, machine learning for healthcare has limited open clinical notes due to concerns about patient privacy. Health Insurance Portability and Accountability Act (HIPAA) of 1996 allows disclosing "de-identified health information" via Safe Harbor, which requires removing 18 types of attributes and ensuring the individual cannot be re-identified.

A conventional de-identification approach detects tokens that are deemed to be relevant to HIPAA-protected attributes and removes or replaces them. Since this is time-consuming, an automated named entity recognition (NER) approach is often employed. Re-identification, however, is still possible because contextual identifiers (e.g., "homeless") are not protected attributes.

Here we formalized the de-identification problem using causal graphs and showed how NER-based de-identification fails to remove dependencies between de-identified notes and protected attributes.

Empirically, we de-identified proprietary clinical notes using an NER-based de-identifier and finetuned a public BERT model to predict demographic attributes from the de-identified notes. We showed that it can recover patients' sex, neighborhood, visit year, visit month, income, and insurance provider with above-random chance and just 1000 training examples.

These attributes can be further used to re-identify patients. For example, a hacker can filter people in an external database (e.g., voter registration records) using the predicted attributes. Here, we used the original patient database as an ideal external database (due to the legal constraints of using the voter database). We showed that the finetuned model has better-than-random accuracy in re-identifying patients in a group. To assess individual-level risk, or the probability of being re-identified as an individual, we assume the hacker guesses the patient by uniformly drawing one person from the re-identified group. The risk is around three in a thousand using the fully finetuned model and around three hundred and eighty in a million using the model finetuned with just 1000 examples.

Sharing information as innocuous as a patient's medical diagnosis also enables better-than-random prediction of their neighborhood, showing that identity leaks may come from all parts of a note. This suggests that a compromise between privacy and utility is unavoidable because even essential information, such as diagnoses, may be used to leak patient identities. We discuss how to find the right compromise and encourage researchers, the government, and the healthcare industry to participate in this dialogue.

Submission Number: 577