Abstract: GRAC is a large reference corpus of Ukrainian spanning over 200 years. The system of morphological analysis used to mark up the corpus was originally designed only for modern language. Meanwhile, the corpus includes texts that sometimes differ significantly from modern ones. Orthographic systems different from the modern standard have been used throughout the history of the Ukrainian language, including regional and diaspora publications. The article describes the algorithms used in the corpus to handle old orthographies and some other cases of nonstandard spelling, and discusses the prospects for their development. The developed tools have been made available online to the NLP and CL community.
Loading