SYN2020: A New Corpus of Czech with an Innovated AnnotationOpen Website

Published: 01 Jan 2021, Last Modified: 14 Mar 2024TDS 2021Readers: Everyone
Abstract: The paper introduces the SYN2020 corpus, a newly released representative corpus of written Czech following the tradition of the Czech National Corpus SYN series. The design of SYN2020 incorporates several substantial new features in the area of segmentation, lemmatization and morphological tagging, such as a new treatment of lemma variants, a new system for identifying morphological categories of verbs or a new treatment of multiword tokens. The annotation process, including data and tools used, is described, and the tools and accuracy of the annotation are discussed as well.
0 Replies

Loading