A GPT-2 Language Model for Biomedical Texts in Portuguese

Elisa Terumi Rubel Schneider, João Vitor Andrioli de Souza, Yohan Bonescki Gumiel, Claudia M. C. Moro, Emerson Cabrera Paraiso

2021 (modified: 07 Nov 2021)CBMS 2021Readers: Everyone

Abstract: Electronic health records (EHRs) contain patient-related information formed by structured and unstructured data, a valuable data source for Natural Language Processing (NLP) in the healthcare domain. The contextual word embeddings and Transformer-based models have proved their potential, reaching state-of-the-art for various NLP tasks. Although the performance for downstream NLP tasks with free-texts written in English has recently improved, less resource is available considering clinical texts and low-resource languages such as Portuguese. Our objective is to develop a Generative Pre-trained Transformer 2 (GPT-2) language model for Portuguese to support clinical and biomedical NLP tasks. We fine-tuned a generic Portuguese GPT-2 model to corpora of biomedical texts written in Portuguese, using transfer learning. We experimented on a public dataset, manually annotated for detecting patient fall, i.e., a classification task. Our in-domain GPT-2 model outperformed the generic Portuguese GPT-2 model by 3.43 in F1-score (weighted). Our preliminary results show that transfer learning with domain literature can benefit Portuguese biomedical NLP tasks, aligned with other languages' results.

0 Replies