Track: Corpus Track
Keywords: Linguistic Linked Open Data, Etymology, Latin, Wiktionary
TL;DR: We present a curated resource of Latin etymological chains automatically extracted from Wiktionary, enriched with links to the LiLa Knowledge Base of Latin and modelled as RDF triples using the LemonEty ontology.
Abstract: We present a curated resource of Latin etymological chains automatically extracted from Wiktionary, enriched with links to the LiLa Knowledge Base of Latin and modelled as RDF triples using the LemonEty ontology. We also present the Python pipeline the data was generated with, as it can be reused to extract Wiktionary’s etymologies for other languages. The etymology chains cover Latin words and their attested or reconstructed ancestors in languages such as Proto-Indo-European, Proto-Italic, Ancient Greek, Hebrew, Egyptian, and others. To address the structural noise and editorial heterogeneity of Wiktionary etymology data, we have introduced strong rule-based filters throughout the pipeline, especially in the curation stage. After validation, the resulting dataset contains 9,684 curated etymological chains, which can be used to support research in Historical Linguistics, Computational Etymology and language learning, among other applications.
Submission Number: 22
Loading