Abstract: Unlike Western languages, word segmentation is necessary for Japanese sentences because they do not have word boundaries. The performances of existing morphological analyzers for Japanese sentences are very high. However, it is difficult to segment sentences mostly written in Hiragana, which is a Japanese writing system simpler than Kanji, because clues to segment the sentences decrease. In this study, we created a word segmentation model of Hiragana sentences using two types of BERT: unigram and bigram BERT models. We pre-trained the BERT models with Wikipedia and fine-tuned them with the core data of the Balanced Corpus of Contemporary Written Japanese for word segmentation. In addition to the two types of BERT-based word segmentation systems, we developed a word segmentation system for Hiragana sentences using KyTea, a toolkit developed for analyzing text, with a focus on languages requiring word segmentation. We compared them in word segmentation of Hiragana sentences. The experiments revealed that the unigram BERT-based word segmentation system outperformed the bigram BERT-based word segmentation system and the KyTea-based word segmentation system.
Loading