LegalEc: A New Corpus for Complex Word Identification Research in Law Studies in Ecuatorian Spanish

Jenny Alexandra Ortiz Zambrano, César Espin-Riofrio, Arturo Montejo-Ráez

Published: 2023, Last Modified: 12 Mar 2024Proces. del Leng. Natural 2023Readers: Everyone

Abstract: In this paper, we present LegalEc, a new annotated corpus of complex lexis constructed from legal texts in Ecuadorian Spanish. We detail its compilation and annotation process. In order to provide a resource for the scientific community to continue research in the area of Lexical Simplification in the Spanish language, several complex word prediction experiments have been carried out on this corpus. We extracted 23 linguistic features which we combined with the encodings generated by models such as XLM-RoBERTa and RoBERTa-BNE (from the MarIA project). The evaluation shows that the combination of these features improves the prediction of lexical complexity.

0 Replies