Abstract: The availability of user generated textual data in different activities online, such as tweets and reviews has been used in many machine learning models. However, the user generated text could be a privacy leakage source for the individuals’ private-attributes. In this paper, we study the privacy issues in the user generated text and propose a privacy-preserving text representation learning framework, $${DP}_{BERT}$$ , which learns the textual representation. Our proposed framework uses BERT to extract the sentences embedding to learn the textual representation that (1) is differentially private to protect against identity leakage (e.g., if a target instance in the data or not), (2) protects against leakage of private-attributes information (e.g., age, gender, location), and (3) maintains the high utility of the given text.
0 Replies
Loading