Establishing clinical NLP modelling recommendations for restricted data availability settingsDownload PDF


16 Dec 2023ACL ARR 2023 December Blind Submission
TL;DR: Recommendations for clinical NLP modelling in restricted data availability settings.
Abstract: When solving clinical Natural Language Processing (NLP) downstream tasks, it is well-established that incorporating clinical-specific knowledge enhances model performances. However, there are scenarios where access to data or domain-specific models is not feasible. Despite various paradigms for adapting clinical NLP-based models, such as fine-tuning already pre-trained language models, pre-training and fine-tuning models, or in-context learning, the advantages of each alternative regarding data availability still need to be explored. We determined the impact of data availability and paradigm selection in the performance of models on solving multiple clinical NLP tasks in Spanish by simulating multiple clinical data availability settings and testing various NLP modelling paradigms. Overall, the best-performing modelling strategy was pre-training a masked language model (LM) with environment-specific unannotated text starting from an off-the-shelf clinical checkpoint and then fine-tuning the LM for the downstream task. The increase in performance from the continuation of pre-training of an off-the-shelf LM is marginal, considering the high amount of resources needed for the pre-training; therefore, we recommend the fine-tuning of an off-the-shelf clinical-specific LM if the model and task-specific data are available. We recommend a few-shot learning technique using a large LM if no data is available.
Paper Type: long
Research Area: Resources and Evaluation
Languages Studied: Spanish
