GovAIEc: A Lexical Complexity Corpus for Spanish in Ecuadorian public documents

ACL ARR 2024 June Submission4010 Authors

16 Jun 2024 (modified: 04 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this article, we present GovAI\textit{Ec}, a new annotated corpus of complex lexicon created with institutional texts in Ecuadorian Spanish, and we detail the process of compiling and annotating this corpus. With the aim of providing a valuable resource to the scientific community to advance research in the field of Lexical Simplification in the Spanish language, we carried out several complex word prediction experiments using this corpus. The complex word labeling process was carried out with a group of annotators with different levels of literacy, in order to ensure a comprehensive evaluation. We use Lexical Complexity metrics as units of analysis, and apply advanced multilingual language models such as XLM-RoBERTa-Base, RoBERTa-large-BNE, XLM-RoBERTa-Large and BERT to evaluate the corpus. This corpus is invaluable for identifying words that represent barriers in the reading comprehension of users who interact with bureaucratic procedures of various entities in Ecuador.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: corpus, complex words, LLMs, spanish
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: english, spanish
Submission Number: 4010