De-identification in natural language processing

Veronika Vincze, Richárd Farkas

Published: 2014, Last Modified: 19 Feb 2025MIPRO 2014EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Natural language processing (NLP) systems usually require a huge amount of textual data but the publication of such datasets is often hindered by privacy and data protection issues. Here, we discuss the questions of de-identification related to three NLP areas, namely, clinical NLP, NLP for social media and information extraction from resumes. We also illustrate how de-identification is related to named entity recognition and we argue that de-identification tools can be successfully built on named entity recognizers.