Vocabulary Modifications for Domain-adaptive Pretraining of Clinical Language Models

Anastasios Lamproudis, Aron Henriksson, Hercules Dalianis

Published: 01 Jan 2022, Last Modified: 19 Feb 2025HEALTHINF 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Research has shown that using generic language models – specifically, BERT models – in specialized domains may be sub-optimal due to domain differences in language use and vocabulary. There are several techniques for developing domain-specific language models that leverage the use of existing generic language models, including continued and domain-adaptive pretraining with in-domain data. Here, we investigate a strategy based on using a domain-specific vocabulary, while leveraging a generic language model for initialization. The results demonstrate that domain-adaptive pretraining, in combination with a domain-specific vocabulary – as opposed to a general-domain vocabulary – yields improvements on two downstream clinical NLP tasks for Swedish. The results highlight the value of domain-adaptive pretraining when developing specialized language models and indicate that it is beneficial to adapt the vocabulary of the language model to the target domain prior to continued, domain-adaptive