Revisiting De-Identification of Electronic Medical Records: Evaluation of Within- and Cross-Hospital Generalization

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX
Submission Type: Regular Short Paper
Submission Track: Resources and Evaluation
Submission Track 2: Machine Learning for NLP
Keywords: De-Identification, Electronic Medical Records, Domain Generalization
Abstract: The de-identification task aims to detect and remove the protected health information from electronic medical records (EMRs). Previous studies generally focus on the within-hospital setting and achieve great successes, while the cross-hospital setting has been overlooked. This study introduces a new de-identification dataset comprising EMRs from three hospitals in China, creating a benchmark for evaluating both within- and cross-hospital generalization. We find significant domain discrepancy between hospitals. A model with almost perfect within-hospital performance struggles when transferred across hospitals. Further experiments show that pretrained language models and some domain generalization methods can alleviate this problem. We believe that our data and findings will encourage investigations on the generalization of medical NLP models.
Submission Number: 2395
Loading