Abstract: Accurate human-annotated data for real-worlduse cases can be scarce and expensive to obtain. In the clinical domain, obtaining such data is evenmore difficult due to privacy concerns which notonly restrict open access to quality data but also require that the annotation be done by domain experts. In this paper, we propose a novel framework - InterDAPT - that leverages Intermediate Domain Finetuning to allow language models to adapt to narrow domains with small, noisy datasets. By making use of peripherally-related, unlabeled datasets,this framework circumvents domain-specific datascarcity issues. Our results show that this weaklysupervised framework provides performance improvements in downstream clinical named entityrecognition tasks.
Loading