Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to ModelingDownload PDF


16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task.We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings---words from one language that are introduced into another without orthographic adaptation---and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings (along with character embeddings and Spanish and English subword embeddings) outperforms results obtained by a multilingual BERT-based model.
0 Replies
