Development of a Universal Dependencies treebank for Welsh

Published: 30 Jun 2019, Last Modified: 19 Feb 2025Celtic Language Technology Workshop 2019EveryoneCC BY 4.0
Abstract: This paper describes the development of the first syntactically-annotated corpus of Welsh within the Universal Dependencies (UD) project. We explain how the corpus was prepared, and some Welsh-specific constructions that require attention. The treebank currently contains 10 756 tokens. An 10-fold cross evaluation shows that results of both, tagging and dependency parsing, are similar to other treebanks of comparable size, notably the other Celtic language treebanks within the UD project.
Loading