Abstract: Recent advances in natural language processing (NLP) have significantly raised expectations for generative models to produce coherent text across diverse languages varieties. In the particular case of the Portuguese language, a predominance of Brazilian Portuguese corpora online induces linguistics traces on those models, limiting its adoption outside Brazil. To address this gap and promote the creation of European Portuguese resources, we developed a cross-domain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese. The findings of the literature review process motivated us to compile PtBrVarId, a cross-domain LVI corpus, and to study how transformer-based LVI classifiers can be optimised to perform in a cross-domain scenario. Our most effective model, a PtBrVarId fine-tuned version of BERT, sets a new state-of-the-art result of $0.84$ $F_1$-Score on the DSL-TL corpus, the LVI reference benchmark. This result was obtained while maintaining state-of-the-art (SOTA) results (above 0.90 $F_1$-Score) in the cross-domain scenario. Although this research is focused on two Portuguese varieties, its ideas can be extended to other varieties and languages. We open-source the code, corpus, and models to foster further research in this task.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: Portuguese
0 Replies
Loading