Analyzing the Leakage of Personal Information in Synthetic Clinical Spanish Texts

Luis Miranda; Jocelyn Dunstan; Matias Toro; Federico Olmedo; Félix Melo

Analyzing the Leakage of Personal Information in Synthetic Clinical Spanish Texts

Luis Miranda, Jocelyn Dunstan, Matias Toro, Federico Olmedo, Félix Melo

Published: 06 Oct 2024, Last Modified: 12 Nov 2024WiNLP 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Synthetic Text Generation; Clinical NLP; LLM Memorization; Privacy Preserving; Differential Privacy;Differentially Private Stochastic Gradient Descent; Spanish NLP

TL;DR: Ensuring true privacy in synthetic clinical text generation is crucial—by addressing memorization risks in AI, we can protect sensitive data while unlocking the potential of healthcare innovation.

Abstract: Using medical data for Deep Learning models can be highly beneficial, but protecting sensitive and personal patient information in the clinical field is critical. One of the most common ways to use this data while protecting patient privacy is by generating synthetic text with Large Language Models (LLMs) using differential privacy (DP). Although DP techniques, such as the Differentially Private Stochastic Gradient Descent (DP-SGD), are often assumed to guarantee privacy, they require specific conditions to be met. This study shows how memorization in LLMs can occur when these privacy guarantees are compromised, potentially leading to the leakage of personal and sensitive information in generated clinical reports. If these gaps are addressed, DP could offer more reliable safeguards for clinical data, improving privacy without sacrificing utility.

Submission Number: 51

Loading