Evaluating Privacy Risks in Synthetic Clinical Text Generation in Spanish

Luis Miranda; Jocelyn Dunstan; Matías Toro; Federico Olmedo; Felix Melo

Evaluating Privacy Risks in Synthetic Clinical Text Generation in Spanish

Luis Miranda, Jocelyn Dunstan, Matías Toro, Federico Olmedo, Felix Melo

Published: 18 Oct 2024, Last Modified: 21 Nov 2024lxai-neurips-24EveryoneRevisionsBibTeXCC BY 4.0

Track: Short Paper

Abstract: Leveraging medical data for Deep Learning models holds great potential, but ensuring the protection of sensitive patient information is paramount in the clinical domain. A widely used approach to balance data utility and privacy is the generation of synthetic text with Large Language Models (LLMs) under the framework of differential privacy (DP). Techniques like Differentially Private Stochastic Gradient Descent (DP-SGD) are typically considered to provide privacy guarantees, but they rely on specific conditions. This research demonstrates how memorization in LLMs can deteriorate when these privacy safeguards are not fully met, increasing the risk of personal and sensitive information being leaked in synthetic clinical reports. Addressing these vulnerabilities could enhance the reliability of DP in protecting clinical text data while maintaining its utility.

Submission Number: 63

Loading