Legal-ES: A Set of Large Scale Resources for Spanish Legal Text Processing

Doaa Samy, Jerónimo Arenas-García, David Pérez-Fernández

Published: 2020, Last Modified: 18 Feb 2025LT4Gov@LREC 2020EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Legal-ES is an open source resource kit for legal Spanish. It consists of a large scale Spanish corpus of open legal texts and different kinds of language models including word embeddings and topic models. The corpus includes over 1000 million words covering a collection of legislative and administrative open access documents in Spanish from different sources representing international, national and regional entities. The corpus is pre-processed and tokenized using Spacy. For the word embeddings, gensim was used on the collection of tokens, producing a representation space that is especially suited to reflect the inherent characteristics of the legal domain. We calculate also topic models to obtain a convenient tool to understand the main topics in the corpus and to navigate through the documents exploiting the semantic similarity among documents. We will analyse the time structure of a dynamic topic model to infer changes in the legal production of Spanish jurisdiction that have occurred over the analysed time framework.