Abstract: The translation quality of Neural Machine Translation (NMT) systems depends strongly on
the training data size. Sufficient amounts of parallel data are, however, not available for many language
pairs. This paper presents a corpus augmentation method, which has two variations: one is for all
language pairs, and the other is for the Chinese-Japanese language pair. The method uses both source
and target sentences of the existing parallel corpus and generates multiple pseudo-parallel sentence
pairs from a long parallel sentence pair containing punctuation marks as follows: (1) split the sentence
pair into parallel partial sentences; (2) back-translate the target partial sentences; and (3) replace each
partial sentence in the source sentence with the back-translated target partial sentence to generate
pseudo-source sentences. The word alignment information, which is used to determine the split points,
is modified with “shared Chinese character rates” in segments of the sentence pairs. The experiment
results of the Japanese-Chinese and Chinese-Japanese translation with ASPEC-JC (Asian Scientific Paper
Excerpt Corpus, Japanese-Chinese) show that the method substantially improves translation performance.
We also supply the code (see Supplementary Materials) that can reproduce our proposed method.
0 Replies
Loading