Quasi-Parallel Corpora for Less-Resourced Languages: Parallelized Translations of Plato´s Faidon in Basque and Finnish

University of Eastern Finland DRDHum 2024 Conference Submission68 Authors

Published: 03 Jun 2024, Last Modified: 03 Jun 2024DRDHum 2024 withRevisionsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Parallel Corpora, less-resourced languages, Basque, Finnish, Plato, translation, Text Alignment, Universal Dependencies, Annotation.
Abstract: The European Language Equality program aims, among other things, to shorten the technological gap between English and the rest of the European languages (European Parliament. Directorate General for Parliamentary Research Services., 2017; Aldabe et al., 2023; Gaspari et al., 2023). In this spirit, we present to our knowledge the first aligned Basque-Finnish corpus, both European non-Indo-European languages; it could be seen as a forerunner of a larger desideratum project of building a multilingual aligned corpus comprising all the European non-Indo-European languages to be used for both contrastive linguistic studies and a testbed for shared strategies and approaches to Language technologies, given some typological convergences such as their postpositional nature or their rich morphology. This work presents a feasible and cheap path to building such a corpus by blurring out somewhat the sharp distinction between comparable and parallel corpora (McEnery & Xiao, 2018) and coining the term “quasi-parallel” to qualify the parallelization of those already available translations of a common (classical?) omega source. Finally, this work travels through all four stages of building a corpus: a) from printed to machine-readable, b) the standardization to erase graphemic idiosyncrasies to facilitate the next two steps, c) the alignment, and d) the automatic annotation following the Universal Dependencies (de Marneffe et al., 2021). Besides the corrected text, the outcome of the first task is a dictionary of regular errors for that particular typeface; the products of the second task are, on the one hand, a dictionary of words from the original spelling to the standard and, on the other, a language-independent program written in Python to perform such a substitution over the original text and as the outcome the standardized version. The alignment poses the question of surpassing the algorithms based on word counts and punctuation, and it has been performed based on the meaning. The annotation is in its initial stages and must be curated by hand after being automatically annotated.
Submission Number: 68
Loading