Abstract: GRAC is a large reference corpus of Ukrainian. Some texts come from OCR-ed documents and contain errors that must be fixed before inclusion in the corpus. Ukrainian texts also contain foreign-language (mostly Russian) fragments that need to be identified and either excised or left in place, depending on their length. To address these key issues, the GRAC team has developed algorithms and tools and applied them to corpus texts. A discussion of the specifics is provided and potential for further development is outlined. The developed tools have been made available online to the NLP and CL community.
Loading