A Cross Search Method for Data Augmentation in Neural Machine Translation

Published: 01 Jan 2024, Last Modified: 25 Jul 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large language models (LLMs) have shown excellent performance on general machine translation. However, LLMs suffer from high deployment cost and unsatisfying quality on low-resource domains. To this end, we explore to build base translation models with LLM-enhanced data augmentation. For data augmentation, we propose a cross search method to obtain qualified parallel in-domain corpus. This method encompasses two distinct approaches: antagony-cross search and similarity-cross search. Antagony-cross search helps to generate monolingual data that closely aligns with the target domain by employing token-level control. Similarity-cross search keeps the alignment between source and target sentences through a similarity score in back translation, so that the generated target language is closer to the source language semantically. With the proposed method, we generate millions of high-quality parallel in-domain corpus from low-resource monolingual data. Our proposed method achieves improvements of approximately 0.5-4 BLEU scores in these domains.
Loading