Evaluating the Confidentiality of Synthetic Clinical Texts Generated by Language Models

Foucauld Estignard, Sahar Ghannay, Julien Girard-Satabin, Nicolas Hiebel, Aurélie Névéol

Published: 01 Jan 2025, Last Modified: 03 Oct 2025AIME (1) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large Language Models (LLMs) can be used to produce synthetic documents that mimic real documents when these are not available due to confidentiality or copyright restrictions. Herein, we investigate potential privacy breaches in automatically generated documents. We use synthetic texts generated from a pre-trained model fine-tuned on French clinical cases to evaluate potential privacy breaches according to three directions: (1) similarity between real, training corpus and synthetic corpus (2) strong correlations between clinical features in training and synthetic corpus and (3) Membership Inference Attack (MIA) using a fined tuned model on the synthetic corpus. We identify clinical feature associations that suggest strategies for filtering training corpus that could contribute to privacy preservation. Membership attacks were not conclusive.