Is there a core vocabulary for Czech? Introducing the Czech General Service ListOpen Website

27 Jun 2023OpenReview Archive Direct UploadReaders: Everyone
Abstract: The current study presents the Czech General Service List (CGSL), which was designed to capture the core vocabulary of written and spoken Czech which is useful to Czech as a second language learners (CSLLs). The CGSL is a result of robust comparison of five Czech language corpora (SYN2020, csTenTen17, Koditex, ORALv1, and ORTOFONv2) containing over 12 billion running words. These five corpora represent a variety of corpus sizes, designs, and text types of both written and spoken Czech. This study investigates the overlap between the top 10,000 words in these corpora based on their normalized average reduced frequency (ARFn), which is a measure that takes into consideration both frequency and dispersion. This study also investigates the overlap and rank correlation between words from the written and spoken corpora, respectively. Significant differences were found between words used in written and spoken Czech, so the CGSL was built to contain three types of words: 1) core words of Czech, 2) core words of written Czech, and 3) core words of spoken Czech. Finally, this study compared the words on the CGSL to words on pedagogical wordlists from Czech textbooks designed for L1 English speaking CSLLs and found there to be significant differences between the two. This suggests that future CSL materials informed by the CGSL might have a different effect on Czech learning than the currently existing CSL materials.
0 Replies

Loading