Abstract: Lemmatization and morphological tagging is an indispensable step in Slovak corpus
linguistics. In this article, we evaluate two state-of-the-art Slovak language lemmatizers and MSD
taggers. One is based on MorphoDiTa and the other is based on spaCy. We measured accuracy on
the test subset of manually lemmatized and MSD annotated corpus and found that the combination
of lemma and tag achieved 93.5% accuracy with MorphoDiTa, and 95.6% accuracy with spaCy.
Most of the errors occurred in disambiguating MSD tags for homonymous uninflected parts of
speech such as particles, conjunctions, and adverbs, and in disambiguating singular masculine
inanimate nominative and accusative. In these cases, spaCy shows a noticeable improvement over
MorphoDiTa, likely due to a better exploitation of the context of the words.
Loading