Archiving Submission: No (non-archival)
Previous Venue If Non Archival: None
Keywords: morphological segmentation, subword representations, morphological analysis, morphologically-rich languages, pos tagging
TL;DR: We experiment with morphologically-guided tokenization methods for Latin, finding that they improve downstream POS and morphological feature tagging performance compared to baseline tokenizers.
Abstract: Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich, medium-resource language. For both the standard WordPiece and Unigram Language Model (ULM) tokenization models, we propose two variations: one seeded with known morphological suffixes in the tokenizer vocabulary, and another using contextual pre-tokenization with a language-specific, lexicon-based morphological analyzer. From each learned tokenizer, we pretrain Latin BERT and evaluate its performance on POS and morphological feature classification. We find that morphologically-guided tokenization improves overall performance (e.g., 36% relative error reduction for morphological feature accuracy), with particularly large gains for specific, morphologically-signalled features (e.g., 54% relative error reduction for tense prediction). Our results highlight the utility of morphological linguistic resources to improve language modeling for morphologically complex languages.
Submission Number: 30
Loading