Sentence Embedding Generation Method for Differential Privacy Protection

Published: 2025, Last Modified: 05 Jan 2026ACISP (3) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: General-purpose NLP language models represented by Bert usually map the original text into dense vector representations according to certain rules, but language models often extract some sensitive words in the original plaintext that are not expected to be exposed in the feature vectors, such as a user’s identity information or location information. Defense mechanisms using differential privacy techniques have been proposed to quantify this kind of privacy loss, but there are some limitations and shortcomings. In this paper, a more effective method for generating private sentence vectors is designed, which provides provable \(\epsilon -dp\) privacy for sentence vectors on the basis of maintaining the high availability of sentence vectors for downstream tasks through the differential privacy noise mechanism. This method mainly includes a sensitivity calculation method based on MASK mechanism and clustering, and a sentence vector noising method combining correlation and sensitivity. Among them, the former mainly realizes the sensitive vocabulary of annotation, finds a set of alternative words and calculates the sensitivity. The latter formulates a reasonable privacy budget allocation rule based on the sensitivity and the contribution of each dimension calculated by the explainability algorithm to the model output, and perturbates each dimension of the sentence vector according to this rule. Experimental results show that this method can achieve a good trade-off between usability and privacy security on the basis of solving the shortcomings of existing schemes.
Loading