Gos 2: A New Reference Corpus of Spoken Slovenian

Darinka Verdonik, Kaja Dobrovoljc, Tomaz Erjavec, Nikola Ljubesic

Published: 2024, Last Modified: 24 Apr 2025LREC/COLING 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper introduces a new version of the Gos reference corpus of spoken Slovenian, which was recently extended to more than double the original size (300 hours, 2.4 million words) by adding speech recordings and transcriptions from two related initiatives, the Gos VideoLectures corpus of public academic speech, and the Artur speech recognition database. We describe this process by first presenting the criteria guiding the balanced selection of the newly added data and the challenges encountered when merging language resources with divergent designs, followed by the presentation of other major enhancements of the new Gos corpus, such as improvements in lemmatization and morphosyntactic annotation, word-level speech alignment, a new XML schema and the development of a specialized online concordancer.