A Dataset of Latin Etymologies Extracted from Wiktionary

Javier de Torres Méndez; Marco Carlo Passarotti; Giovanni Moretti; Francesco Mambrini; Matteo Pellegrini

A Dataset of Latin Etymologies Extracted from Wiktionary

Javier de Torres Méndez, Marco Carlo Passarotti, Giovanni Moretti, Francesco Mambrini, Matteo Pellegrini

16 Mar 2026 (modified: 19 May 2026)SwissText 2026 Conference SubmissionEveryoneRevisionsCC BY 4.0

Track: Corpus Track

Keywords: Linguistic Linked Open Data, Etymology, Latin, Wiktionary

TL;DR: We present a curated resource of Latin etymological chains automatically extracted from Wiktionary, enriched with links to the LiLa Knowledge Base of Latin and modelled as RDF triples using the LemonEty ontology.

Abstract: We present a curated resource of Latin etymological chains automatically extracted from Wiktionary, enriched with links to the LiLa Knowledge Base of Latin and modelled as RDF triples using the LemonEty ontology. We also present the Python pipeline the data was generated with, as it can be reused to extract Wiktionary’s etymologies for other languages. The etymology chains cover Latin words and their attested or reconstructed ancestors in languages such as Proto-Indo-European, Proto-Italic, Ancient Greek, Hebrew, Egyptian, and others. To address the structural noise and editorial heterogeneity of Wiktionary etymology data, we have introduced strong rule-based filters throughout the pipeline, especially in the curation stage. After validation, the resulting dataset contains 9,684 curated etymological chains, which can be used to support research in Historical Linguistics, Computational Etymology and language learning, among other applications.

Submission Number: 22

Loading