Track: Long Paper Track (up to 9 pages)
Keywords: Privacy, NLP, Text, Reidentification, Data Release, Sanitization, Anonymization
TL;DR: Privacy evaluation for quantifying disclosure risks of sanitized dataset release beyond surface level, exposing false sense of privacy
Abstract: Sanitizing sensitive text data for release often relies on methods that remove personally identifiable information (PII) or generate synthetic data.
However, evaluations of these methods have focused on measuring surface-level privacy leakage (e.g., revealing explicit identifiers like names).
We propose the first semantic privacy evaluation framework for sanitized textual datasets, leveraging re-identification attacks.
On medical records and chatbot dialogue datasets, we demonstrate that seemingly innocuous auxiliary information, such as a mention of specific speech patterns, can be used to deduce sensitive attributes like age or substance use history.
PII removal techniques make only surface-level textual manipulations: e.g., the industry-standard Azure PII removal tool fails to protect 89\% of the original information.
On the other hand, synthesizing data with differential privacy protects sensitive information but garbles the data, rendering it much less useful for downstream tasks.
Our findings reveal that current data sanitization methods create a \textit{false sense of privacy}, and underscore the urgent need for more robust methods that both protect privacy and preserve utility.
Submission Number: 85
Loading