Impact of Tokenization on Language Models: An Analysis for TurkishDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Tokenization is an important text preprocessing step to prepare input tokens for language models. WordPiece and BPE are de-facto methods employed by large language models, such as BERT and GPT. However, the impact of tokenization can be different for the agglutinative languages having words with prefixes and suffixes, such as Turkic languages. We compare five tokenization methods, including a morphological-level tokenization that takes agglutinative language structure into account. We train tokenizers, and pre-train mini language models using RoBERTa pre-training procedure on Turkish OSCAR corpus. We then fine-tune our models on six downstream tasks. There are two main outcomes: (i) Morphological and word-level tokenizers outperform de-facto tokenizers in particular cases. (ii) Mini models can be competitive to larger state-of-the-art models, such that a 14-times smaller model can recover 94\% of the performance of a larger model.
0 Replies

Loading