Abstract: The Japanese language has many homographs, which are words that share the same letters, regardless of their pronunciations. For example, “辛い” has two pronunciations, “karai” and “tsurai”, which mean “hot taste” and “hard” or “tough” respectively. Therefore, pronunciation estimation of homographs is necessary to read Japanese sentences accurately. In this study, we develop a system to estimate the pronunciations of homographs using a Bidirectional Encoder Representations from the Transformer model. This research is the first trial of pronunciation estimation of all homographs and we achieved this goal using the technique for all-words word sense disambiguation. We used the Corpus of Spontaneous Japanese (CSJ), a transcription of spoken Japanese, as the test data and utilized the non-core data of the Balanced Corpus of Contemporary Written in Japanese, for which pronunciations are automatically tagged by a Japanese morphological analyzer, in addition to CSJ, as training data to reduce the cost of transcription. We show that automatically tagged data from a written Japanese corpus can improve the accuracy of pronunciation estimation.
Loading