WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation
Abstract: Movie and TV subtitles are frequently employed in natural language processing (NLP)
applications, but there are limited Japanese-Chinese bilingual corpora accessible as a dataset to train
neural machine translation (NMT) models. In our previous study, we effectively constructed a corpus
of a considerable size containing bilingual text data in both Japanese and Chinese by collecting
subtitle text data from websites that host movies and television series. The unsatisfactory translation
performance of the initial corpus, Web-Crawled Corpus of Japanese and Chinese (WCC-JC 1.0), was
predominantly caused by the limited number of sentence pairs. To address this shortcoming, we
thoroughly analyzed the issues associated with the construction of WCC-JC 1.0 and constructed the
WCC-JC 2.0 corpus by first collecting subtitle data from movie and TV series websites. Then, we
manually aligned a large number of high-quality sentence pairs. Our efforts resulted in a new corpus
that includes about 1.4 million sentence pairs, an 87% increase compared with WCC-JC 1.0. As a
result, WCC-JC 2.0 is now among the largest publicly available Japanese-Chinese bilingual corpora
in the world. To assess the performance of WCC-JC 2.0, we calculated the BLEU scores relative to
other comparative corpora and performed manual evaluations of the translation results generated by
translation models trained on WCC-JC 2.0. We provide WCC-JC 2.0 as a free download for research
purposes only.
0 Replies
Loading