Comparison of Current Approaches to Lemmatization: A Case Study in EstonianDownload PDF

Published: 20 Mar 2023, Last Modified: 29 Aug 2024NoDaLiDa 2023Readers: Everyone
Keywords: lemmatization, deep learning, classification, pattern-based, generative, rule-based, morphology, estonian, case study, comparison, character-level, universal dependencies, token classification, transformers, huggingface
TL;DR: We're comparing 3 distinct approaches to lemmatization to Estonian language and report the findings.
Abstract: This study evaluates three different lemmatization approaches to Estonian---Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis. According to our experiments, a significantly smaller Generative model consistently outperforms the Pattern-based classification model based on EstBERT. Additionally, we observe a relatively small overlap in errors made by all three models, indicating that an ensemble of different approach could lead to improvements.
Student Paper: Yes, the first author is a student
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/comparison-of-current-approaches-to/code)
3 Replies

Loading