Publicly Available Clinical BERT Embeddings

Emily Alsentzer, John Murphy, Willie Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, Matthew McDermott

07 May 2020OpenReview Archive Direct UploadReaders: Everyone

Abstract: Contextual word embedding models such as ELMo (Peters et al., 2018) and BERT (De- vlin et al., 2018) have dramatically improved performance for many natural language pro- cessing (NLP) tasks in recent months. How- ever, these models have been minimally ex- plored on specialty corpora, such as clini- cal text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this work, we address this need by exploring and releasing BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically. We demonstrate that using a domain-specific model yields performance improvements on three common clinical NLP tasks as compared to nonspecific embeddings. These domain- specific models are not as performant on two clinical de-identification tasks, and argue that this is a natural consequence of the differences between de-identified source text and synthet- ically non de-identified task text.

0 Replies